GVKun编程网logo

elasticsearch Mapping 定义索引(elasticsearch mapping 定义索引内容长度)

3

这篇文章主要围绕elasticsearchMapping定义索引和elasticsearchmapping定义索引内容长度展开,旨在为您提供一份详细的参考资料。我们将全面介绍elasticsearch

这篇文章主要围绕elasticsearch Mapping 定义索引elasticsearch mapping 定义索引内容长度展开,旨在为您提供一份详细的参考资料。我们将全面介绍elasticsearch Mapping 定义索引的优缺点,解答elasticsearch mapping 定义索引内容长度的相关问题,同时也会为您带来43、elasticsearch(搜索引擎)的mapping映射管理、Elasticsearch 5.4 Mapping详解、Elasticsearch 5.5 Mapping 详解、Elasticsearch Mapping的实用方法。

本文目录一览:

elasticsearch Mapping 定义索引(elasticsearch mapping 定义索引内容长度)

elasticsearch Mapping 定义索引(elasticsearch mapping 定义索引内容长度)

Mapping is the process of defining how a document should be mapped to the Search Engine, including its searchable characteristics such as which fields are searchable and if/how they are tokenized. In ElasticSearch, an index may store documents of different "mapping types". ElasticSearch allows one to associate multiple mapping definitions for each mapping type.

Explicit mapping is defined on an index/type level. By default, there isn’t a need to define an explicit mapping, since one is automatically created and registered when a new type or new field is introduced (with no performance overhead) and have sensible defaults. Only when the defaults need to be overridden must a mapping definition be provided.

mapping types

Mapping types are a way to divide the documents in an index into logical groups. Think of it as tables in a database. Though there is separation between types, it’s not a full separation (all end up as a document within the same Lucene index).

Field names with the same name across types are highly recommended to have the same type and same mapping characteristics (analysis settings for example). There is an effort to allow to explicitly "choose" which field to use by using type prefix (my_type.my_field), but it’s not complete, and there are places where it will never work (like faceting on the field).

In practice though, this restriction is almost never an issue. The field name usually ends up being a good indication to its "typeness" (e.g. "first_name" will always be a string). Note also, that this does not apply to the cross index case.

global settings

The index.mapping.ignore_malformed global setting can be set on the index level to allow to ignore malformed content globally across all mapping types (malformed content example is trying to index a text string value as a numeric type).

The index.mapping.coerce global setting can be set on the index level to coerce numeric content globally across all mapping types (The default setting is true and coercions attempted are to convert strings with numbers into numeric types and also numeric values with fractions to any integer/short/long values minus the fraction part). When the permitted conversions fail in their attempts, the value is considered malformed and the ignore_malformed setting dictates what will happen next.


 

Fields

1)_uid

Each document indexed is associated with an id and a type, the internal _uid field is the unique identifier of a document within an index and is composed of the type and the id (meaning that different types can have the same id and still maintain uniqueness).

The _uid field is automatically used when _type is not indexed to perform type based filtering, and does not require the _id to be indexed.

【_udi=type+id,即不同的type可以存在相同id。】

2)_id

Each document indexed is associated with an id and a type. The _id field can be used to index just the id, and possible also store it. By default it is not indexed and not stored (thus, not created).

Note, even though the _id is not indexed, all the APIs still work (since they work with the _uidfield), as well as fetching by ids using termterms or prefix queries/filters (including the specificids query/filter).

【_id默认是不索引、不存储,那么对其进行的各项查询操作将由_uid负责。】

The _id field can be enabled to be indexed, and possibly stored, using:

{
    "tweet":{
        "_id":{"index":"not_analyzed","store":false}
    }
}

The _id mapping can also be associated with a path that will be used to extract the id from a different location in the source document. For example, having the following mapping:

{
    "tweet":{
        "_id":{
            "path":"post_id"
        }
    }
}

Will cause 1 to be used as the id for:

{
    "message":"You know, for Search",
    "post_id":"1"
}

This does require an additional lightweight parsing step while indexing, in order to extract the id to decide which shard the index operation will be executed on.

3)_type

Each document indexed is associated with an id and a type. The type, when indexing, is automatically indexed into a _type field. By default, the _type field is indexed (but not analyzed) and not stored. This means that the _type field can be queried.

【有个_type字段用来索引type,那么每次type检索是否要加上_type字段检索条件?】

The _type field can be stored as well, for example:

{
    "tweet":{
        "_type":{"store":true}
    }
}

The _type field can also not be indexed, and all the APIs will still work except for specific queries (term queries / filters) or faceting done on the _type field.

{
    "tweet":{
        "_type":{"index":"no"}
    }
}

4)_source

The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executing "fetch" requests, like get or search, the _source field is returned by default.

【_source理解为正向文本,该字段生效时将占用不小的额外空间。】

Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled. For example:

{
    "tweet":{
        "_source":{"enabled":false}
    }
}

includes / excludes

Allow to specify paths in the source that would be included / excluded when it’s stored, supporting * as wildcard annotation. For example:

{
    "my_type":{
        "_source":{
            "includes":["path1.*","path2.*"],
            "excludes":["pat3.*"]
        }
    }
}

5)_all

The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size.

The _all fields can be completely disabled. Explicit field mappings and object mappings can be excluded / included in the _all field. By default, it is enabled and all fields are included in it for ease of use.

When disabling the _all field, it is a good practice to set index.query.default_field to a different value (for example, if you have a main "message" field in your data, set it to message).

【当_all字段不可用是,最佳实践是指定默认检索字段index.query.default_field】

One of the nice features of the _all field is that it takes into account specific fields boost levels. Meaning that if a title field is boosted more than content, the title (part) in the _all field will mean more than the content (part) in the _all field.

Here is a sample mapping:

{
    "person":{
        "_all":{"enabled":true},
        "properties":{
            "name":{
                "type":"object",
                "dynamic":false,
                "properties":{
                    "first":{"type":"string","store":true,"include_in_all":false},
                    "last":{"type":"string","index":"not_analyzed"}
                }
            },
            "address":{
                "type":"object",
                "include_in_all":false,
                "properties":{
                    "first":{
                        "properties":{
                            "location":{"type":"string","store":true,"index_name":"firstLocation"}
                        }
                    },
                    "last":{
                        "properties":{
                            "location":{"type":"string"}
                        }
                    }
                }
            },
            "simple1":{"type":"long","include_in_all":true},
            "simple2":{"type":"long","include_in_all":false}
        }
    }
}

The _all fields allows for storeterm_vector and analyzer (with specific index_analyzer and search_analyzer) to be set.

highlighting

For any field to allow highlighting it has to be either stored or part of the _source field. By default the _all field does not qualify for either, so highlighting for it does not yield any data.

Although it is possible to store the _all field, it is basically an aggregation of all fields, which means more data will be stored, and highlighting it might produce strange results.

6)_analyzer

The _analyzer mapping allows to use a document field property as the name of the analyzer that will be used to index the document. The analyzer will be used for any field that does not explicitly defines an analyzer or index_analyzer when indexing.

Here is a simple mapping:

{
    "type1":{
        "_analyzer":{
            "path":"my_field"
        }
    }
}

The above will use the value of the my_field to lookup an analyzer registered under it. For example, indexing the following doc:

{
    "my_field":"whitespace"
}

Will cause the whitespace analyzer to be used as the index analyzer for all fields without explicit analyzer setting.

The default path value is _analyzer, so the analyzer can be driven for a specific document by setting the _analyzer field in it. If a custom json field name is needed, an explicit mapping with a different path should be set.

By default, the _analyzer field is indexed, it can be disabled by settings index to no in the mapping.

7)_boost

Boosting is the process of enhancing the relevancy of a document or field. Field level mapping allows to define an explicit boost level on a specific field. The boost field mapping (applied on theroot object) allows to define a boost field mapping where its content will control the boost level of the document. For example, consider the following mapping:

{
    "tweet":{
        "_boost":{"name":"my_boost","null_value":1.0}
    }
}

The above mapping defines a mapping for a field named my_boost. If the my_boost field exists within the JSON document indexed, its value will control the boost level of the document indexed. For example, the following JSON document will be indexed with a boost value of 2.2:

{
    "my_boost":2.2,
    "message":"This is a tweet!"
}

function score instead of boost

Support for document boosting via the _boost field has been removed from Lucene and is deprecated in Elasticsearch as of v1.0.0.RC1. The implementation in Lucene resulted in unpredictable result when used with multiple fields or multi-value fields.

Instead, the Function Score Query can be used to achieve the desired functionality by boosting each document by the value in any field the document:

{
    "query":{
        "function_score":{
            "query":{  
                "match":{
                    "title":"your main query"
                }
            },
            "functions":[{
                "script_score":{
                    "script":"doc[''my_boost_field''].value"
                }
            }],
            "score_mode":"multiply"
        }
    }
}

8)_parent

The parent field mapping is defined on a child mapping, and points to the parent type this child relates to. For example, in case of a blog type and a blog_tag type child document, the mapping for blog_tag should be:

{
    "blog_tag":{
        "_parent":{
            "type":"blog"
        }
    }
}

The mapping is automatically stored and indexed (meaning it can be searched on using the _parent field notation).

9)_routing

The routing field allows to control the _routing aspect when indexing data and explicit routing control is required.

store / index

The first thing the _routing mapping does is to store the routing value provided (store set to false) and index it (index set to not_analyzed). The reason why the routing is stored by default is so reindexing data will be possible if the routing value is completely external and not part of the docs.

required

Another aspect of the _routing mapping is the ability to define it as required by setting requiredto true. This is very important to set when using routing features, as it allows different APIs to make use of it. For example, an index operation will be rejected if no routing value has been provided (or derived from the doc). A delete operation will be broadcasted to all shards if no routing value is provided and _routing is required.

path

The routing value can be provided as an external value when indexing (and still stored as part of the document, in much the same way _source is stored). But, it can also be automatically extracted from the index doc based on a path. For example, having the following mapping:

{
    "comment":{
        "_routing":{
            "required":true,
            "path":"blog.post_id"
        }
    }
}

Will cause the following doc to be routed based on the 111222 value:

{
    "text":"the comment text"
    "blog":{
        "post_id":"111222"
    }
}

Note, using path without explicit routing value provided required an additional (though quite fast) parsing phase.

id uniqueness

When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed throughout all the shards that the index is composed of. In fact, documents with the same _id might end up in different shards if indexed with different _routing values.

10)_index

The ability to store in a document the index it belongs to. By default it is disabled, in order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_index":{"enabled":true}
    }
}

11)_size

The _size field allows to automatically index the size of the original _source indexed. By default, it’s disabled. In order to enable it, set the mapping to:

【限定_source字段的大小】

{
    "tweet":{
        "_size":{"enabled":true}
    }
}

In order to also store it, use:

{
    "tweet":{
        "_size":{"enabled":true,"store":true}
    }
}

12)timestamp

The _timestamp field allows to automatically index the timestamp of a document. It can be provided externally via the index request or in the _source. If it is not provided externally it will be automatically set to the date the document was processed by the indexing chain.

【时间戳 如果没有提供时间戳,将自动生成。】

enabled

By default it is disabled. In order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_timestamp":{"enabled":true}
    }
}

store / index

By default the _timestamp field has store set to false and index set to not_analyzed. It can be queried as a standard date field.

path

The _timestamp value can be provided as an external value when indexing. But, it can also be automatically extracted from the document to index based on a path. For example, having the following mapping:

{
    "tweet":{
        "_timestamp":{
            "enabled":true,
            "path":"post_date"
        }
    }
}

Will cause 2009-11-15T14:12:12 to be used as the timestamp value for:

{
    "message":"You know, for Search",
    "post_date":"2009-11-15T14:12:12"
}

Note, using path without explicit timestamp value provided require an additional (though quite fast) parsing phase.

format

You can define the date format used to parse the provided timestamp value. For example:

{
    "tweet":{
        "_timestamp":{
            "enabled":true,
            "path":"post_date",
            "format":"YYYY-MM-dd"
        }
    }
}

Note, the default format is dateOptionalTime. The timestamp value will first be parsed as a number and if it fails the format will be tried.

13)_ttl

A lot of documents naturally come with an expiration date. Documents can therefore have a _ttl(time to live), which will cause the expired documents to be deleted automatically.

【ttl - time to live! 可以用来设置文档的过期时间。 】

enabled

By default it is disabled, in order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_ttl":{"enabled":true}
    }
}

store / index

By default the _ttl field has store set to true and index set to not_analyzed. Note that indexproperty has to be set to not_analyzed in order for the purge process to work.

default

You can provide a per index/type default _ttl value as follows:

{
    "tweet":{
        "_ttl":{"enabled":true,"default":"1d"}
    }
}

In this case, if you don’t provide a _ttl value in your query or in the _source all tweets will have a_ttl of one day.

In case you do not specify a time unit like d (days), m (minutes), h (hours), ms (milliseconds) or w(weeks), milliseconds is used as default unit.

If no default is set and no _ttl value is given then the document has an infinite _ttl and will not expire.

You can dynamically update the default value using the put mapping API. It won’t change the _ttl of already indexed documents but will be used for future documents.

note on documents expiration

Expired documents will be automatically deleted regularly. You can dynamically set the indices.ttl.interval to fit your needs. The default value is 60s.

The deletion orders are processed by bulk. You can set indices.ttl.bulk_size to fit your needs. The default value is 10000.

Note that the expiration procedure handle versioning properly so if a document is updated between the collection of documents to expire and the delete order, the document won’t be deleted.

 


 

Types

1)core types

Each JSON field can be mapped to a specific core type. JSON itself already provides us with some typing, with its support for stringinteger/longfloat/doubleboolean, and null.

The following sample tweet JSON document will be used to explain the core types:

{
    "tweet"{
        "user":"kimchy"
        "message":"This is a tweet!",
        "postDate":"2009-11-15T14:12:12",
        "priority":4,
        "rank":12.3
    }
}

Explicit mapping for the above JSON tweet can be:

{
    "tweet":{
        "properties":{
            "user":{"type":"string","index":"not_analyzed"},
            "message":{"type":"string","null_value":"na"},
            "postDate":{"type":"date"},
            "priority":{"type":"integer"},
            "rank":{"type":"float"}
        }
    }
}

string

The text based string type is the most basic type, and contains one or more characters. An example mapping can be:

{
    "tweet":{
        "properties":{
            "message":{
                "type":"string",
                "store":true,
                "index":"analyzed",
                "null_value":"na"
            },
            "user":{
                "type":"string",
                "index":"not_analyzed",
                "norms":{
                    "enabled":false
                }
            }
        }
    }
}

The above mapping defines a string message property/field within the tweet type. The field is stored in the index (so it can later be retrieved using selective loading when searching), and it gets analyzed (broken down into searchable terms). If the message has a null value, then the value that will be stored is na. There is also a string user which is indexed as-is (not broken down into tokens) and has norms disabled (so that matching this field is a binary decision, no match is better than another one).

The following table lists all the attributes that can be used with the string type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to actually store the field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to analyzed for the field to be indexed and searchable after being broken down into token using an analyzer. not_analyzed means that its still searchable, but does not go through any analysis process or broken down into tokens. no means that it won’t be searchable at all (as an individual field; it may still be included in _all). Setting to no disables include_in_all. Defaults to analyzed.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

term_vector

Possible values are noyeswith_offsetswith_positionswith_positions_offsets. Defaults to no.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

norms: {enabled: <value>}

Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.

norms: {loading: <value>}

Describes how norms should be loaded, possible values are eager and lazy (default). It is possible to change the default value to eager for all fields by configuring the index setting index.norms.loading to eager.

index_options

Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).

analyzer

The analyzer used to analyze the text contents when analyzed during indexing and when searching using a query string. Defaults to the globally configured analyzer.

index_analyzer

The analyzer used to analyze the text contents when analyzed during indexing.

search_analyzer

The analyzer used to analyze the field when part of a query string. Can be updated on an existing field.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_above

The analyzer will ignore strings larger than this size. Useful for generic not_analyzed fields that should ignore long text.

position_offset_gap

Position increment gap between field instances with the same field name. Defaults to 0.

The string type also support custom indexing parameters associated with the indexed value. For example:

{
    "message":{
        "_value":  "boosted value",
        "_boost":  2.0
    }
}

The mapping is required to disambiguate the meaning of the document. Otherwise, the structure would interpret "message" as a value of type "object". The key _value (or value) in the inner document specifies the real string content that should eventually be indexed. The _boost (or boost) key specifies the per field document boost (here 2.0).

norms

Norms store various normalization factors that are later used (at query time) in order to compute the score of a document relatively to a query.

Although useful for scoring, norms also require quite a lot of memory (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, it is highly recommended to disable norms on it. In particular, this is the case for fields that are used solely for filtering or aggregations.

Coming in 1.2.0.

In case you would like to disable norms after the fact, it is possible to do so by using the PUT mapping API. Please however note that norms won’t be removed instantly, but as your index will receive new insertions or updates, and segments get merged. Any score computation on a field that got norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.

number

A number based type supporting floatdoublebyteshortinteger, and long. It uses specific constructs within Lucene in order to support numeric values. The number types have the same ranges as corresponding Java types. An example mapping can be:

{
    "tweet":{
        "properties":{
            "rank":{
                "type":"float",
                "null_value":1.0
            }
        }
    }
}

The following table lists all the attributes that can be used with a numbered type:

Attribute Description

type

The type of the number. Can be floatdoubleintegerlongshortbyte. Required.

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_all enabled, or store be set to true for this to be useful.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_malformed

Ignored a malformed number. Defaults to false.

coerce

Try convert strings to numbers and truncate fractions for integers. Defaults to true.

token count

The token_count type maps to the JSON string type but indexes and stores the number of tokens in the string rather than the string itself. For example:

{
    "tweet":{
        "properties":{
            "name":{
                "type":"string",
                "fields":{
                    "word_count":{
                        "type":"token_count",
                        "store":"yes",
                        "analyzer":"standard"
                    }
                }
            }
        }
    }
}

All the configuration that can be specified for a number can be specified for a token_count. The only extra configuration is the required analyzer field which specifies which analyzer to use to break the string into tokens. For best performance, use an analyzer with no token filters.

Technically the token_count type sums position increments rather than counting tokens. This means that even if the analyzer filters out stop words they are included in the count.

date

The date type is a special type which maps to JSON string type. It follows a specific format that can be explicitly set. All dates are UTC. Internally, a date maps to a number type long, with the added parsing stage from string to long and from long to string. An example mapping:

{
    "tweet":{
        "properties":{
            "postDate":{
                "type":"date",
                "format":"YYYY-MM-dd"
            }
        }
    }
}

The date type will also accept a long number representing UTC milliseconds since the epoch, regardless of the format it can handle.

The following table lists all the attributes that can be used with a date type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

format

The date format. Defaults to dateOptionalTime.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_all enabled, or store be set to true for this to be useful.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_malformed

Ignored a malformed number. Defaults to false.

boolean

The boolean type Maps to the JSON boolean type. It ends up storing within the index either T or F, with automatic translation to true and false respectively.

{
    "tweet":{
        "properties":{
            "hes_my_special_tweet":{
                "type":"boolean",
            }
        }
    }
}

The boolean type also supports passing the value as a number or a string (in this case 0, an empty string, Ffalseoff and no are false, all other values are true).

The following table lists all the attributes that can be used with the boolean type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false(note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_allenabled, or store be set to true for this to be useful.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

binary

The binary type is a base64 representation of binary data that can be stored in the index. The field is not stored by default and not indexed at all.

{
    "tweet":{
        "properties":{
            "image":{
                "type":"binary",
            }
        }
    }
}

The following table lists all the attributes that can be used with the binary type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false(note, the JSON document itself is stored, and it can be retrieved from it).

fielddata filters

It is possible to control which field values are loaded into memory, which is particularly useful for faceting on string fields, using fielddata filters, which are explained in detail in the Fielddatasection.

Fielddata filters can exclude terms which do not match a regex, or which don’t fall between a minand max frequency range:

{
    tweet:{
        type:      "string",
        analyzer:  "whitespace"
        fielddata:{
            filter:{
                regex:{
                    "pattern":        "^#.*"
                },
                frequency:{
                    min:              0.001,
                    max:              0.1,
                    min_segment_size:500
                }
            }
        }
    }
}

These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.

postings format

Posting formats define how fields are written into the index and how fields are represented into memory. Posting formats can be defined per field via the postings_format option. Postings format are configurable. Elasticsearch has several builtin formats:

direct

A postings format that uses disk-based storage but loads its terms and postings directly into memory. Note this postings format is very memory intensive and has certain limitation that don’t allow segments to grow beyond 2.1GB see {@link DirectPostingsFormat} for details.

memory

A postings format that stores its entire terms, postings, positions and payloads in a finite state transducer. This format should only be used for primary keys or with fields where each term is contained in a very low number of documents.

pulsing

A postings format that in-lines the posting lists for very low frequent terms in the term dictionary. This is useful to improve lookup performance for low-frequent terms.

bloom_default

A postings format that uses a bloom filter to improve term lookup performance. This is useful for primary keys or fields that are used as a delete key.

bloom_pulsing

A postings format that combines the advantages of bloom and pulsing to further improve lookup performance.

default

The default Elasticsearch postings format offering best general purpose performance. This format is used if no postings format is specified in the field mapping.

postings format example

On all field types it possible to configure a postings_format attribute:

{
  "person":{
     "properties":{
         "second_person_id":{"type":"string","postings_format":"pulsing"}
     }
  }
}

On top of using the built-in posting formats it is possible define custom postings format. See codec module for more information.

doc values format

Doc values formats define how fields are written into column-stride storage in the index for the purpose of sorting or faceting. Fields that have doc values enabled will have special field data instances, which will not be uninverted from the inverted index, but directly read from disk. This makes _refresh faster and ultimately allows for having field data stored on disk depending on the configured doc values format.

Doc values formats are configurable. Elasticsearch has several builtin formats:

memory

A doc values format which stores data in memory. Compared to the default field data implementations, using doc values with this format will have similar performance but will be faster to load, making _refresh less time-consuming.

disk

A doc values format which stores all data on disk, requiring almost no memory from the JVM at the cost of a slight performance degradation.

default

The default Elasticsearch doc values format, offering good performance with low memory usage. This format is used if no format is specified in the field mapping.

doc values format example

On all field types, it is possible to configure a doc_values_format attribute:

{
  "product":{
     "properties":{
         "price":{"type":"integer","doc_values_format":"memory"}
     }
  }
}

On top of using the built-in doc values formats it is possible to define custom doc values formats. See codec module for more information.

similarity

Elasticsearch allows you to configure a similarity (scoring algorithm) per field. The similaritysetting provides a simple way of choosing a similarity algorithm other than the default TF/IDF, such as BM25.

You can configure similarities via the similarity module

configuring similarity per field

Defining the Similarity for a field is done via the similarity mapping property, as this example shows:

{
  "book":{
    "properties":{
      "title":{"type":"string","similarity":"BM25"}
    }
}

The following Similarities are configured out-of-box:

default

The Default TF/IDF algorithm used by Elasticsearch and Lucene in previous versions.

BM25

The BM25 algorithm. See Okapi_BM25 for more details.

copy to field

Added in 1.0.0.RC2.

Adding copy_to parameter to any field mapping will cause all values of this field to be copied to fields specified in the parameter. In the following example all values from fields title and abstract will be copied to the field meta_data.

{
  "book":{
    "properties":{
      "title":{"type":"string","copy_to":"meta_data"},
      "abstract":{"type":"string","copy_to":"meta_data"},
      "meta_data":{"type":"string"},
    }
}

Multiple fields are also supported:

{
  "book":{
    "properties":{
      "title":{"type":"string","copy_to":["meta_data","article_info"]},
    }
}

multi fields

Added in 1.0.0.RC1.

The fields options allows to map several core types fields into a single json source field. This can be useful if a single field need to be used in different ways. For example a single field is to be used for both free text search and sorting.

{
  "tweet":{
    "properties":{
      "name":{
        "type":"string",
        "index":"analyzed",
        "fields":{
          "raw":{"type":"string","index":"not_analyzed"}
        }
      }
    }
  }
}

In the above example the field name gets processed twice. The first time it gets processed as an analyzed string and this version is accessible under the field name name, this is the main field and is in fact just like any other field. The second time it gets processed as a not analyzed string and is accessible under the name name.raw.

include in all

The include_in_all setting is ignored on any field that is defined in the fields options. Setting the include_in_all only makes sense on the main field, since the raw field value to copied to the _all field, the tokens aren’t copied.

updating a field

In the essence a field can’t be updated. However multi fields can be added to existing fields. This allows for example to have a different index_analyzer configuration in addition to the already configured index_analyzer configuration specified in the main and other multi fields.

Also the new multi field will only be applied on document that have been added after the multi field has been added and in fact the new multi field doesn’t exist in existing documents.

Another important note is that new multi fields will be merged into the list of existing multi fields, so when adding new multi fields for a field previous added multi fields don’t need to be specified.

accessing fields

deprecated in 1.0.0.

Use copy_to instead.

The multi fields defined in the fields are prefixed with the name of the main field and can be accessed by their full path using the navigation notation: name.raw, or using the typed navigation notation tweet.name.raw. The path option allows to control how fields are accessed. If the pathoption is set to full, then the full path of the main field is prefixed, but if the path option is set to just_name the actual multi field name without any prefix is used. The default value for the pathoption is full.

The just_name setting, among other things, allows indexing content of multiple fields under the same name. In the example below the content of both fields first_name and last_name can be accessed by using any_name or tweet.any_name.

{
  "tweet":{
    "properties":{
      "first_name":{
        "type":"string",
        "index":"analyzed",
        "path":"just_name",
        "fields":{
          "any_name":{"type":"string","index":"analyzed"}
        }
      },
      "last_name":{
        "type":"string",
        "index":"analyzed",
        "path":"just_name",
        "fields":{
          "any_name":{"type":"string","index":"analyzed"}
        }
      }
    }
  }
}

 

2)array type

JSON documents allow to define an array (list) of fields or objects. Mapping array types could not be simpler since arrays gets automatically detected and mapping them can be done either withCore Types or Object Type mappings. For example, the following JSON defines several arrays:

{
    "tweet":{
        "message":"some arrays in this tweet...",
        "tags":["elasticsearch","wow"],
        "lists":[
            {
                "name":"prog_list",
                "description":"programming list"
            },
            {
                "name":"cool_list",
                "description":"cool stuff list"
            }
        ]
    }
}

The above JSON has the tags property defining a list of a simple string type, and the listsproperty is an object type array. Here is a sample explicit mapping:

{
    "tweet":{
        "properties":{
            "message":{"type":"string"},
            "tags":{"type":"string","index_name":"tag"},
            "lists":{
                "properties":{
                    "name":{"type":"string"},
                    "description":{"type":"string"}
                }
            }
        }
    }
}

The fact that array types are automatically supported can be shown by the fact that the following JSON document is perfectly fine:

{
    "tweet":{
        "message":"some arrays in this tweet...",
        "tags":"elasticsearch",
        "lists":{
            "name":"prog_list",
            "description":"programming list"
        }
    }
}

Note also, that thanks to the fact that we used the index_name to use the non plural form (taginstead of tags), we can actually refer to the field using the index_name as well. For example, we can execute a query using tweet.tags:wow or tweet.tag:wow. We could, of course, name the field as tag and skip the index_name all together).

3)object type

JSON documents are hierarchical in nature, allowing them to define inner "objects" within the actual JSON. Elasticsearch completely understands the nature of these inner objects and can map them easily, providing query support for their inner fields. Because each document can have objects with different fields each time, objects mapped this way are known as "dynamic". Dynamic mapping is enabled by default. Let’s take the following JSON as an example:

{
    "tweet":{
        "person":{
            "name":{
                "first_name":"Shay",
                "last_name":"Banon"
            },
            "sid":"12345"
        },
        "message":"This is a tweet!"
    }
}

The above shows an example where a tweet includes the actual person details. A person is an object, with a sid, and a name object which has first_name and last_name. It’s important to note that tweet is also an object, although it is a special root object type which allows for additional mapping definitions.

The following is an example of explicit mapping for the above JSON:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "properties":{
                            "first_name":{"type":"string"},
                            "last_name":{"type":"string"}
                        }
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In order to mark a mapping of type object, set the type to object. This is an optional step, since if there are properties defined for it, it will automatically be identified as an object mapping.

properties

An object mapping can optionally define one or more properties using the properties tag for a field. Each property can be either another object, or one of the core_types.

dynamic

One of the most important features of Elasticsearch is its ability to be schema-less. This means that, in our example above, the person object can be indexed later with a new property — age, for example — and it will automatically be added to the mapping definitions. Same goes for the tweetroot object.

This feature is by default turned on, and it’s the dynamic nature of each object mapped. Each object mapped is automatically dynamic, though it can be explicitly turned off:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "dynamic":false,
                        "properties":{
                            "first_name":{"type":"string"},
                            "last_name":{"type":"string"}
                        }
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In the above example, the name object mapped is not dynamic, meaning that if, in the future, we try to index JSON with a middle_name within the name object, it will get discarded and not added.

There is no performance overhead if an object is dynamic, the ability to turn it off is provided as a safety mechanism so "malformed" objects won’t, by mistake, index data that we do not wish to be indexed.

If a dynamic object contains yet another inner object, it will be automatically added to the index and mapped as well.

When processing dynamic new fields, their type is automatically derived. For example, if it is a number, it will automatically be treated as number core_type. Dynamic fields default to their default attributes, for example, they are not stored and they are always indexed.

Date fields are special since they are represented as a string. Date fields are detected if they can be parsed as a date when they are first introduced into the system. The set of date formats that are tested against can be configured using the dynamic_date_formats on the root object, which is explained later.

Note, once a field has been added, its type can not change. For example, if we added age and its value is a number, then it can’t be treated as a string.

The dynamic parameter can also be set to strict, meaning that not only will new fields not be introduced into the mapping, but also that parsing (indexing) docs with such new fields will fail.

enabled

The enabled flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains arbitrary JSON which should not be indexed, nor added to the mapping. For example:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "type":"object",
                        "enabled":false
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In the above, name and its content will not be indexed at all.

include_in_all

include_in_all can be set on the object type level. When set, it propagates down to all the inner mappings defined within the object that do no explicitly set it.

path

deprecated in 1.0.0.

Use copy_to instead.

In the core_types section, a field can have a index_name associated with it in order to control the name of the field that will be stored within the index. When that field exists within an object(s) that are not the root object, the name of the field of the index can either include the full "path" to the field with its index_name, or just the index_name. For example (under mapping of type person, removed the tweet type for clarity):

{
    "person":{
        "properties":{
            "name1":{
                "type":"object",
                "path":"just_name",
                "properties":{
                    "first1":{"type":"string"},
                    "last1":{"type":"string","index_name":"i_last_1"}
                }
            },
            "name2":{
                "type":"object",
                "path":"full",
                "properties":{
                    "first2":{"type":"string"},
                    "last2":{"type":"string","index_name":"i_last_2"}
                }
            }
        }
    }
}

In the above example, the name1 and name2 objects within the person object have different combination of path and index_name. The document fields that will be stored in the index as a result of that are:

JSON Name Document Field Name

name1/first1

first1

name1/last1

i_last_1

name2/first2

name2.first2

name2/last2

name2.i_last_2

Note, when querying or using a field name in any of the APIs provided (search, query, selective loading, …), there is an automatic detection from logical full path and into the index_name and vice versa. For example, even though name1/last1 defines that it is stored with just_name and a different index_name, it can either be referred to using name1.last1 (logical name), or its actual indexed name of i_last_1.

More over, where applicable, for example, in queries, the full path including the type can be used such as person.name.last1, in this case, both the actual indexed name will be resolved to match against the index, and an automatic query filter will be added to only match person types.

4)root object type

The root object mapping is an object type mapping that maps the root object (the type itself). On top of all the different mappings that can be set using the object type mapping, it allows for additional, type level mapping definitions.

The root object mapping allows to index a JSON document that either starts with the actual mapping type, or only contains its fields. For example, the following tweet JSON can be indexed:

{
    "message":"This is a tweet!"
}

But, also the following JSON can be indexed:

{
    "tweet":{
        "message":"This is a tweet!"
    }
}

Out of the two, it is preferable to use the document without the type explicitly set.

index / search analyzers

The root object allows to define type mapping level analyzers for index and search that will be used with all different fields that do not explicitly set analyzers on their own. Here is an example:

{
    "tweet":{
        "index_analyzer":"standard",
        "search_analyzer":"standard"
    }
}

The above simply explicitly defines both the index_analyzer and search_analyzer that will be used. There is also an option to use the analyzer attribute to set both the search_analyzer and index_analyzer.

dynamic_date_formats

dynamic_date_formats (old setting called date_formats still works) is the ability to set one or more date formats that will be used to detect date fields. For example:

{
    "tweet":{
        "dynamic_date_formats":["yyyy-MM-dd","dd-MM-yyyy"],
        "properties":{
            "message":{"type":"string"}
        }
    }
}

In the above mapping, if a new JSON field of type string is detected, the date formats specified will be used in order to check if its a date. If it passes parsing, then the field will be declared with datetype, and will use the matching format as its format attribute. The date format itself is explainedhere.

The default formats are: dateOptionalTime (ISO) and yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z.

Note: dynamic_date_formats are used only for dynamically added date fields, not for date fields that you specify in your mapping.

date_detection

Allows to disable automatic date type detection (if a new field is introduced and matches the provided format), for example:

{
    "tweet":{
        "date_detection":false,
        "properties":{
            "message":{"type":"string"}
        }
    }
}

numeric_detection

Sometimes, even though json has support for native numeric types, numeric values are still provided as strings. In order to try and automatically detect numeric values from string, the numeric_detection can be set to true. For example:

{
    "tweet":{
        "numeric_detection":true,
        "properties":{
            "message":{"type":"string"}
        }
    }
}

dynamic_templates

Dynamic templates allow to define mapping templates that will be applied when dynamic introduction of fields / objects happens.

For example, we might want to have all fields to be stored by default, or all string fields to be stored, or have string fields to always be indexed with multi fields syntax, once analyzed and once not_analyzed. Here is a simple example:

{
    "person":{
        "dynamic_templates":[
            {
                "template_1":{
                    "match":"multi*",
                    "mapping":{
                        "type":"{dynamic_type}",
                        "index":"analyzed",
                        "fields":{
                            "org":{"type":"{dynamic_type}","index":"not_analyzed"}
                        }
                    }
                }
            },
            {
                "template_2":{
                    "match":"*",
                    "match_mapping_type":"string",
                    "mapping":{
                        "type":"string",
                        "index":"not_analyzed"
                    }
                }
            }
        ]
    }
}

The above mapping will create a field with multi fields for all field names starting with multi, and will map all string types to be not_analyzed.

Dynamic templates are named to allow for simple merge behavior. A new mapping, just with a new template can be "put" and that template will be added, or if it has the same name, the template will be replaced.

The match allow to define matching on the field name. An unmatch option is also available to exclude fields if they do match on match. The match_mapping_type controls if this template will be applied only for dynamic fields of the specified type (as guessed by the json format).

Another option is to use path_match, which allows to match the dynamic template against the "full" dot notation name of the field (for example obj1.*.value or obj1.obj2.*), with the respective path_unmatch.

The format of all the matching is simple format, allowing to use * as a matching element supporting simple patterns such as xxx*, *xxx, xxx*yyy (with arbitrary number of pattern types), as well as direct equality. The match_pattern can be set to regex to allow for regular expression based matching.

The mapping element provides the actual mapping definition. The {name} keyword can be used and will be replaced with the actual dynamic field name being introduced. The {dynamic_type}(or {dynamicType}) can be used and will be replaced with the mapping derived based on the field type (or the derived type, like date).

Complete generic settings can also be applied, for example, to have all mappings be stored, just set:

{
    "person":{
        "dynamic_templates":[
            {
                "store_generic":{
                    "match":"*",
                    "mapping":{
                        "store":true
                    }
                }
            }
        ]
    }
}

Such generic templates should be placed at the end of the dynamic_templates list because when two or more dynamic templates match a field, only the first matching one from the list is used.

5)nested type

Nested objects/documents allow to map certain sections in the document indexed as nested allowing to query them as if they are separate docs joining with the parent owning doc.

One of the problems when indexing inner objects that occur several times in a doc is that "cross object" search match will occur, for example:

{
    "obj1":[
        {
            "name":"blue",
            "count":4
        },
        {
            "name":"green",
            "count":6
        }
    ]
}

Searching for name set to blue and count higher than 5 will match the doc, because in the first element the name matches blue, and in the second element, count matches "higher than 5".

Nested mapping allows mapping certain inner objects (usually multi instance ones), for example:

{
    "type1":{
        "properties":{
            "obj1":{
                "type":"nested",
                "properties":{
                    "name":{"type":"string","index":"not_analyzed"},
                    "count":{"type":"integer"}
                }
            }
        }
    }
}

The above will cause all obj1 to be indexed as a nested doc. The mapping is similar in nature to setting type to object, except that it’s nested. Nested object fields can be defined explicitly as in the example above or added dynamically in the same way as for the root object.

Note: changing an object type to nested type requires reindexing.

The nested object fields can also be automatically added to the immediate parent by setting include_in_parent to true, and also included in the root object by setting include_in_root to true.

Nested docs will also automatically use the root doc _all field.

Searching on nested docs can be done using either the nested query or nested filter.

internal implementation

Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.

Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.

Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested query. For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested query scope.

The _source field is always associated with the parent document and because of that field values via the source can be fetched for nested object.

6)ip type

An ip mapping type allows to store ipv4 addresses in a numeric form allowing to easily sort, and range query it (using ip values).

The following table lists all the attributes that can be used with an ip type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. In this case, store should be set totrue, since if it’s not indexed and not stored, there is nothing to do with it.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). Defaults to true or to the parent object type setting.

7)geo point type

Mapper type called geo_point to support geo based points. The declaration looks as follows:

{
    "pin":{
        "properties":{
            "location":{
                "type":"geo_point"
            }
        }
    }
}

indexed fields

The geo_point mapping will index a single field with the format of lat,lon. The lat_lon option can be set to also index the .lat and .lon as numeric fields, and geohash can be set to true to also index .geohash value.

A good practice is to enable indexing lat_lon as well, since both the geo distance and bounding box filters can either be executed using in memory checks, or using the indexed lat lon values, and it really depends on the data set which one performs better. Note though, that indexed lat lon only make sense when there is a single geo point value for the field, and not multi values.

geohashes

Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.

Because geohashes are just strings, they can be stored in an inverted index like any other string, which makes querying them very efficient.

If you enable the geohash option, a geohash “sub-field” will be indexed as, eg pin.geohash. The length of the geohash is controlled by the geohash_precision parameter, which can either be set to an absolute length (eg 12, the default) or to a distance (eg 1km).

More usefully, set the geohash_prefix option to true to not only index the geohash value, but all the enclosing cells as well. For instance, a geohash of u30 will be indexed as [u,u3,u30]. This option can be used by the Geohash Cell Filter to find geopoints within a particular cell very efficiently.

input structure

The above mapping defines a geo_point, which accepts different formats. The following formats are supported:

lat lon as properties

{
    "pin":{
        "location":{
            "lat":41.12,
            "lon":-71.34
        }
    }
}

lat lon as string

Format in lat,lon.

{
    "pin":{
        "location":"41.12,-71.34"
    }
}

geohash

{
    "pin":{
        "location":"drm3btev3e86"
    }
}

lat lon as array

Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.

{
    "pin":{
        "location":[-71.34,41.12]
    }
}

mapping options

Option Description

lat_lon

Set to true to also index the .lat and .lon as fields. Defaults to false.

geohash

Set to true to also index the .geohash as a field. Defaults to false.

geohash_precision

Sets the geohash precision. It can be set to an absolute geohash length or a distance value (eg 1km, 1m, 1ml) defining the size of the smallest cell. Defaults to an absolute length of 12.

geohash_prefix

If this option is set to true, not only the geohash but also all its parent cells (true prefixes) will be indexed as well. The number of terms that will be indexed depends on the geohash_precision. Defaults to falseNote: This option implicitly enables geohash.

validate

Set to true to reject geo points with invalid latitude or longitude (default is false). Note: Validation only works when normalization has been disabled.

validate_lat

Set to true to reject geo points with an invalid latitude.

validate_lon

Set to true to reject geo points with an invalid longitude.

normalize

Set to true to normalize latitude and longitude (default is true).

normalize_lat

Set to true to normalize latitude.

normalize_lon

Set to true to normalize longitude.

precision_step

The precision step (number of terms generated for each number value) for .lat and .lon fields if lat_lon is set to true. Defaults to 4.

field data

By default, geo points use the array format which loads geo points into two parallel double arrays, making sure there is no precision loss. However, this can require a non-negligible amount of memory (16 bytes per document) which is why Elasticsearch also provides a field data implementation with lossy compression called compressed:

{
    "pin":{
        "properties":{
            "location":{
                "type":"geo_point",
                "fielddata":{
                    "format":"compressed",
                    "precision":"1cm"
                }
            }
        }
    }
}

This field data format comes with a precision option which allows to configure how much precision can be traded for memory. The default value is 1cm. The following table presents values of the memory savings given various precisions:

Precision

Bytes per point

Size reduction

1km

4

75%

3m

6

62.5%

1cm

8

50%

1mm

10

37.5%

Precision can be changed on a live index by using the update mapping API.

usage in scripts

When using doc[geo_field_name] (in the above mapping, doc[''location'']), the doc[...].value returns a GeoPoint, which then allows access to lat and lon (for example, doc[...].value.lat). For performance, it is better to access the lat and lon directly using doc[...].lat and doc[...].lon.

8)geo shape type

The geo_shape mapping type facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points.

You can query documents using this type using geo_shape Filter or geo_shape Query.

Note, the geo_shape type uses Spatial4J and JTS, both of which are optional dependencies. Consequently you must add Spatial4J v0.3 and JTS v1.12 to your classpath in order to use this type.

mapping options

The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.

Option Description

tree

Name of the PrefixTree implementation to be used: geohash for GeohashPrefixTree and quadtree for QuadPrefixTree. Defaults to geohash.

precision

This parameter may be used instead of tree_levels to set an appropriate value for the tree_levels parameter. The value specifies the desired precision and Elasticsearch will calculate the best tree_levels value to honor this precision. The value should be a number followed by an optional distance unit. Valid distance units include: ininchydyardmimiles,kmkilometersm,meters (default), cm,centimetersmmmillimeters.

tree_levels

Maximum number of layers to be used by the PrefixTree. This can be used to control the precision of shape representations and therefore how many terms are indexed. Defaults to the default value of the chosen PrefixTree implementation. Since this parameter requires a certain level of understanding of the underlying implementation, users may use the precision parameter instead. However, Elasticsearch only uses the tree_levels parameter internally and this is what is returned via the mapping API even if you use the precision parameter.

distance_error_pct

Used as a hint to the PrefixTree about how precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum supported value.

prefix trees

To efficiently represent shapes in the index, Shapes are converted into a series of hashes representing grid squares using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth.

Multiple PrefixTree implementations are provided:

  • GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.
  • QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.

accuracy

Geo_shape does not provide 100% accuracy and depending on how it is configured it may return some false positives or false negatives for certain queries. To mitigate this, it is important to select an appropriate value for the tree_levels parameter and to adjust expectations accordingly. For example, a point may be near the border of a particular grid cell and may thus not match a query that only matches the cell right next to it — even though the shape is very close to the point.

example

{
    "properties":{
        "location":{
            "type":"geo_shape",
            "tree":"quadtree",
            "precision":"1m"
        }
    }
}

This mapping maps the location field to the geo_shape type using the quad_tree implementation and a precision of 1m. Elasticsearch translates this into a tree_levels setting of 26.

performance considerations

Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.

The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.

input structure

The GeoJSON format is used to represent Shapes as input as follows:

{
    "location":{
        "type":"point",
        "coordinates":[45.0,-45.0]
    }
}

Note, both the type and coordinates fields are required.

The supported types are pointlinestringpolygonmultipoint and multipolygon.

Note, in geojson the correct order is longitude, latitude coordinate arrays. This differs from some APIs such as e.g. Google Maps that generally use latitude, longitude.

envelope

Elasticsearch supports an envelope type which consists of coordinates for upper left and lower right points of the shape:

{
    "location":{
        "type":"envelope",
        "coordinates":[[-45.0,45.0],[45.0,-45.0]]
    }
}

polygonedit

A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed).

{
    "location":{
        "type":"polygon",
        "coordinates":[
            [[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]
        ]
    }
}

The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"):

{
    "location":{
        "type":"polygon",
        "coordinates":[
            [[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
            [[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]
        ]
    }
}

multipolygonedit

A list of geojson polygons.

{
    "location":{
        "type":"multipolygon",
        "coordinates":[
            [[[102.0,2.0],[103.0,2.0],[103.0,3.0],[102.0,3.0],[102.0,2.0]]],
            [[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
            [[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]]
        ]
    }
}

sorting and retrieving index shapes

Due to the complex input structure and index representation of shapes, it is not currently possible to sort shapes or retrieve their fields directly. The geo_shape value is only retrievable through the _source field.

9)attachment type

The attachment type allows to index different "attachment" type field (encoded as base64), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).

The attachment type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins location. It will be automatically detected and the attachment type will be added.

Note, the attachment type is experimental.

Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:

{
    "person":{
        "properties":{
            "my_attachment":{"type":"attachment"}
        }
    }
}

In this case, the JSON to index can be:

{
    "my_attachment":"... base64 encoded attachment ..."
}

Or it is possible to use more elaborated JSON if content type or resource name need to be set explicitly:

{
    "my_attachment":{
        "_content_type":"application/pdf",
        "_name":"resource/name/of/my.pdf",
        "content":"... base64 encoded attachment ..."
    }
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: datetitleauthor, and keywords. They can be queried using the "dot notation", for example: my_attachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, …), thus, they can be controlled in the mappings. For example:

{
    "person":{
        "properties":{
            "file":{
                "type":"attachment",
                "fields":{
                    "file":{"index":"no"},
                    "date":{"store":true},
                    "author":{"analyzer":"myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika to parse attachments, so many formats are supported, listed here.

 

 

43、elasticsearch(搜索引擎)的mapping映射管理

43、elasticsearch(搜索引擎)的mapping映射管理

【百度云搜索,搜各种资料:http://www.lqkweb.com】
【搜网盘,搜各种资料:http://www.swpan.cn】

1、映射(mapping)介绍

映射:创建索引的时候,可以预先定义字段的类型以及相关属性
elasticsearch会根据json源数据的基础类型猜测你想要的字段映射,将输入的数据转换成可搜索的索引项,mapping就是我们自己定义的字段数据类型,同时告诉elasticsearch如何索引数据以及是否可以被搜索

作用:会让索引建立的更加细致和完善

类型:静态映射和动态映射

2、内置映射类型(也就是数据类型)

string类型:text,keyword两种
  text类型:会进行分词,抽取词干,建立倒排索引
  keyword类型:就是一个普通字符串,只能完全匹配才能搜索到

数字类型:long,integer,short,byte,double,float

日期类型:date

bool(布尔)类型:boolean

binary(二进制)类型:binary

复杂类型:object,nested

geo(地区)类型:geo-point,geo-shape

专业类型:ip,competion

3、属性介绍
store属性
index属性
null_value属性
analyzer属性
include_in_all属性
format属性

image

 更多属性:https://www.elastic.co/guide/...

 image

4、创建索引(相当于创建数据库)、创建表、创建字段-设置字段类型,添加数据

说明:

#创建索引(设置字段类型)
PUT jobbole                         #创建索引设置索引名称
{
  "mappings": {                     #设置mappings映射字段类型
    "job": {                        #表名称
      "properties": {               #设置字段类型
        "title":{                   #title字段
          "type": "text"            #text类型,text类型可以分词,建立倒排索引
        },
        "salary_min":{              #salary_min字段
          "type": "integer"         #integer数字类型
        },
        "city":{                    #city字段
          "type": "keyword"         #keyword普通字符串类型
        },
        "company":{                 #company字段,是嵌套字段
          "properties":{            #设置嵌套字段类型
            "name":{                #name字段
              "type":"text"         #text类型
            },
            "company_addr":{        #company_addr字段
              "type":"text"         #text类型
            },
            "employee_count":{      #employee_count字段
              "type":"integer"      #integer数字类型
            }
          }
        },
        "publish_date":{            #publish_date字段
          "type": "date",           #date时间类型
          "format":"yyyy-MM-dd"     #yyyy-MM-dd格式化时间样式
        },
        "comments":{                #comments字段
          "type": "integer"         #integer数字类型
        }
      }
    }
  }
}

#保存文档(相当于数据库的写入数据)
PUT jobbole/job/1                       #索引名称/表/id
{
  "title":"python分布式爬虫开发",       #字段名称:字段值
  "salary_min":15000,                   #字段名称:字段值
  "city":"北京",                        #字段名称:字段值
  "company":{                           #嵌套字段
    "name":"百度",                      #字段名称:字段值
    "company_addr":"北京市软件园",      #字段名称:字段值
    "employee_count":50                 #字段名称:字段值
  },
  "publish_date":"2017-4-16",           #字段名称:字段值
  "comments":15                         #字段名称:字段值
}

代码:

#创建索引(设置字段类型)
PUT jobbole
{
  "mappings": {
    "job": {
      "properties": {
        "title":{
          "type": "text"
        },
        "salary_min":{
          "type": "integer"
        },
        "city":{
          "type": "keyword"
        },
        "company":{
          "properties":{
            "name":{
              "type":"text"
            },
            "company_addr":{
              "type":"text"
            },
            "employee_count":{
              "type":"integer"
            }
          }
        },
        "publish_date":{
          "type": "date",
          "format":"yyyy-MM-dd"
        },
        "comments":{
          "type": "integer"
        }
      }
    }
  }
}

#保存文档(相当于数据库的写入数据)
PUT jobbole/job/1
{
  "title":"python分布式爬虫开发",
  "salary_min":15000,
  "city":"北京",
  "company":{
    "name":"百度",
    "company_addr":"北京市软件园",
    "employee_count":50
  },
  "publish_date":"2017-4-16",
  "comments":15
}

5、获取索引下的mappings映射字段类型

#获取一个索引下的所有表的mappings映射字段类型
GET jobbole/_mapping
#获取一个索引下的指定表的mappings映射字段类型
GET jobbole/_mapping/job

 image

 【重点】在创建索引时一旦给字段设置了类型后就不可修改了,如果必须要修改就的重新创建索引,所以在创建索引时就必须确定好字段类型

Elasticsearch 5.4 Mapping详解

Elasticsearch 5.4 Mapping详解

  • 前言
  • 一Field datatype字段数据类型
    • 1string类型
    • 2 text类型
    • 3 keyword类型
    • 4 数字类型
    • 5 Object类型
    • 6 date类型
    • 7 Array类型
    • 8 binary类型
    • 9 ip类型
    • 10 range类型
    • 11 nested类型
    • 12token_count类型
    • 13 geo point 类型
  • 二Meta-Fields元数据
    • 1 _all
    • 2 _field_names
    • 3 _id
    • 4 _index
    • 4 _meta
    • 5 _parent
    • 6 _routing
    • 7 _source
    • 8 _type
    • 9 _uid
  • 三Mapping参数
    • 1 analyzer
    • 2 normalizer
    • 3 boost
    • 4 coerce
    • 5 copy_to
    • 6 doc_values
    • 7 dynamic
    • 8 enabled
    • 9 fielddata
    • 10 format
    • 11 ignore_above
    • 12 ignore_malformed
    • 13 include_in_all
    • 14 index
    • 15 index_options
    • 16 fields
    • 17 norms
    • 18 null_value
    • 19 position_increment_gap
    • 20 properties
    • 21 search_analyzer
    • 22 similarity
    • 23 store
    • 24 term_vector
  • 四动态Mapping
    • 1 default mapping
    • 2 Dynamic field mapping
    • 3 Dynamic templates
    • 4 Override default template

 

前言

声明:本博客根据ELasticsearch官网文档翻译整理,转载请注明出处:http://blog.csdn.net/napoay

一、Field datatype(字段数据类型)

1.1string类型

ELasticsearch 5.X之后的字段类型不再支持string,由text或keyword取代。 如果仍使用string,会给出警告。

测试:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type":  "string"
        }
      }
    }
  }
}

结果:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true
}

1.2 text类型

text取代了string,当一个字段是要被全文搜索的,比如Email内容、产品描述,应该使用text类型。设置text类型以后,字段内容会被分析,在生成倒排索引以前,字符串会被分析器分成一个一个词项。text类型的字段不用于排序,很少用于聚合(termsAggregation除外)。

把full_name字段设为text类型的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

1.3 keyword类型

keyword类型适用于索引结构化的字段,比如email地址、主机名、状态码和标签。如果字段需要进行过滤(比如查找已发布博客中status属性为published的文章)、排序、聚合。keyword类型的字段只能通过精确值搜索到。

1.4 数字类型

对于数字类型,ELasticsearch支持以下几种:

类型 取值范围
long -2^63至2^63-1
integer -2^31至2^31-1
short -32,768至32768
byte -128至127
double 64位双精度IEEE 754浮点类型
float 32位单精度IEEE 754浮点类型
half_float 16位半精度IEEE 754浮点类型
scaled_float 缩放类型的的浮点数(比如价格只需要精确到分,price为57.34的字段缩放因子为100,存起来就是5734)

对于float、half_float和scaled_float,-0.0和+0.0是不同的值,使用term查询查找-0.0不会匹配+0.0,同样range查询中上边界是-0.0不会匹配+0.0,下边界是+0.0不会匹配-0.0。

对于数字类型的数据,选择以上数据类型的注意事项:

  1. 在满足需求的情况下,尽可能选择范围小的数据类型。比如,某个字段的取值最大值不会超过100,那么选择byte类型即可。迄今为止吉尼斯记录的人类的年龄的最大值为134岁,对于年龄字段,short足矣。字段的长度越短,索引和搜索的效率越高。
  2. 优先考虑使用带缩放因子的浮点类型。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

1.5 Object类型

JSON天生具有层级关系,文档会包含嵌套的对象:

PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

上面的文档中,整体是一个JSON,JSON中包含一个manager,manager又包含一个name。最终,文档会被索引成一平的key-value对:

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

上面文档结构的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { 
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}

1.6 date类型

JSON中没有日期类型,所以在ELasticsearch中,日期类型可以是以下几种:

  1. 日期格式的字符串:e.g. “2015-01-01” or “2015/01/01 12:10:30”.
  2. long类型的毫秒数( milliseconds-since-the-epoch)
  3. integer的秒数(seconds-since-the-epoch)

日期格式可以自定义,如果没有自定义,默认格式如下:

"strict_date_optional_time||epoch_millis"

 例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "date": "2015-01-01" } 

PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my_index/my_type/3
{ "date": 1420070400001 } 

GET my_index/_search
{
  "sort": { "date": "asc"} 
}

查看三个日期类型:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "date": "2015-01-01"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "date": 1420070400001
        }
      }
    ]
  }
}

排序结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": null,
        "_source": {
          "date": "2015-01-01"
        },
        "sort": [
          1420070400000
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": null,
        "_source": {
          "date": 1420070400001
        },
        "sort": [
          1420070400001
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": null,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        },
        "sort": [
          1420114230000
        ]
      }
    ]
  }
}

1.7 Array类型

ELasticsearch没有专用的数组类型,默认情况下任何字段都可以包含一个或者多个值,但是一个数组中的值要是同一种类型。例如:

  1. 字符数组: [ “one”, “two” ]
  2. 整型数组:[1,3]
  3. 嵌套数组:[1,[2,3]],等价于[1,2,3]
  4. 对象数组:[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

注意事项:

  • 动态添加数据时,数组的第一个值的类型决定整个数组的类型
  • 混合数组类型是不支持的,比如:[1,”abc”]
  • 数组可以包含null值,空数组[ ]会被当做missing field对待。

1.8 binary类型

binary类型接受base64编码的字符串,默认不存储也不可搜索。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

搜索blog字段:

GET my_index/_search
{
  "query": {
    "match": {
      "blob": "test" 
    }
  }
}

返回结果:
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Binary fields do not support searching",
        "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
        "index": "my_index"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "3dQd1RRVTMiKdTckM68nPQ",
        "reason": {
          "type": "query_shard_exception",
          "reason": "Binary fields do not support searching",
          "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
          "index": "my_index"
        }
      }
    ]
  },
  "status": 400
}

Base64加密、解码工具:http://www1.tc711.com/tool/BASE64.htm

1.9 ip类型

ip类型的字段用于存储IPV4或者IPV6的地址。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

1.10 range类型

range类型支持以下几种:

类型 范围
integer_range -2^31至2^31-1
float_range 32-bit IEEE 754
long_range -2^63至2^63-1
double_range 64-bit IEEE 754
date_range 64位整数,毫秒计时

range类型的使用场景:比如前端的时间选择表单、年龄范围选择表单等。 
例子:

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

PUT range_index/my_type/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

上面代码创建了一个range_index索引,expected_attendees的人数为10到20,时间是2015-10-31 12:00:00至2015-11-01。

查询:

POST range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-08-01",
        "lte" : "2015-12-01",
        "relation" : "within" 
      }
    }
  }
}

1.11 nested类型

nested嵌套类型是object中的一个特例,可以让array类型的Object独立索引和查询。 使用Object类型有时会出现问题,比如文档 my_index/my_type/1的结构如下:

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

user字段会被动态添加为Object类型。 
最后会被转换为以下平整的形式:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

user.first和user.last会被平铺为多值字段,Alice和White之间的关联关系会消失。上面的文档会不正确的匹配以下查询(虽然能搜索到,实际上不存在Alice Smith):

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

使用nested字段类型解决Object类型的不足:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

1.12token_count类型

token_count用于统计词频:


PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "length": { 
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "name": "John Smith" }

PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }

GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 
    }
  }
}

1.13 geo point 类型

地理位置信息类型用于存储地理位置信息的经纬度:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}

PUT my_index/my_type/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" 
}

PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

GET my_index/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

二、Meta-Fields(元数据)

2.1 _all

_all字段是把其它字段拼接在一起的超级字段,所有的字段用空格分开,_all字段会被解析和索引,但是不存储。当你只想返回包含某个关键字的文档但是不明确地搜某个字段的时候就需要使用_all字段。 
例子:

PUT my_index/blog/1 
{
  "title":    "Master Java",
  "content":     "learn java",
  "author": "Tom"
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

_all字段包含:[ “Master”, “Java”, “learn”, “Tom” ]

搜索:

GET my_index/_search
{
  "query": {
    "match": {
      "_all": "Java"
    }
  }
}

返回结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.39063013,
    "hits": [
      {
        "_index": "my_index",
        "_type": "blog",
        "_id": "1",
        "_score": 0.39063013,
        "_source": {
          "title": "Master Java",
          "content": "learn java",
          "author": "Tom"
        }
      }
    ]
  }
}

使用copy_to自定义_all字段:

PUT myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "title": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "content": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "full_content": {
          "type":    "text"
        }
      }
    }
  }
}

PUT myindex/mytype/1
{
  "title": "Master Java",
  "content": "learn Java"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_content": "java"
    }
  }
}

2.2 _field_names

_field_names字段用来存储文档中的所有非空字段的名字,这个字段常用于exists查询。例子如下:

PUT my_index/my_type/1
{
  "title": "This is a document"
}

PUT my_index/my_type/2?refresh=true
{
  "title": "This is another document",
  "body": "This document has a body"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "body" ] 
    }
  }
}

结果会返回第二条文档,因为第一条文档没有title字段。 
同样,可以使用exists查询:

GET my_index/_search
{
    "query": {
        "exists" : { "field" : "body" }
    }
}

2.3 _id

每条被索引的文档都有一个_type和_id字段,_id可以用于term查询、temrs查询、match查询、query_string查询、simple_query_string查询,但是不能用于聚合、脚本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] 
    }
  }
}

2.4 _index

多索引查询时,有时候只需要在特地索引名上进行查询,_index字段提供了便利,也就是说可以对索引名进行term查询、terms查询、聚合分析、使用脚本和排序。

_index是一个虚拟字段,不会真的加到Lucene索引中,对_index进行term、terms查询(也包括match、query_string、simple_query_string),但是不支持prefix、wildcard、regexp和fuzzy查询。

举例,2个索引2条文档


PUT index_1/my_type/1
{
  "text": "Document in index 1"
}

PUT index_2/my_type/2
{
  "text": "Document in index 2"
}

对索引名做查询、聚合、排序并使用脚本新增字段:

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "inline": "doc[''_index'']" 
      }
    }
  }
}

2.4 _meta

忽略

2.5 _parent

_parent用于指定同一索引中文档的父子关系。下面例子中现在mapping中指定文档的父子关系,然后索引父文档,索引子文档时指定父id,最后根据子文档查询父文档。

PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}


PUT my_index/my_parent/1 
{
  "text": "This is a parent document"
}

PUT my_index/my_child/2?parent=1 
{
  "text": "This is a child document"
}

PUT my_index/my_child/3?parent=1&refresh=true 
{
  "text": "This is another child document"
}


GET my_index/my_parent/_search
{
  "query": {
    "has_child": { 
      "type": "my_child",
      "query": {
        "match": {
          "text": "child document"
        }
      }
    }
  }
}

2.6 _routing

路由参数,ELasticsearch通过以下公式计算文档应该分到哪个分片上:

shard_num = hash(_routing) % num_primary_shards

默认的_routing值是文档的_id或者_parent,通过_routing参数可以设置自定义路由。例如,想把user1发布的博客存储到同一个分片上,索引时指定routing参数,查询时在指定路由上查询:

PUT my_index/my_type/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my_index/my_type/1?routing=user1

在查询的时候通过routing参数查询:

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  }
}

GET my_index/_search?routing=user1,user2 
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}

在Mapping中指定routing为必须的:

PUT my_index2
{
  "mappings": {
    "my_type": {
      "_routing": {
        "required": true 
      }
    }
  }
}

PUT my_index2/my_type/1 
{
  "text": "No routing value provided"
}

2.7 _source

存储的文档的原始值。默认_source字段是开启的,也可以关闭:

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

但是一般情况下不要关闭,除法你不想做一些操作:

  • 使用update、update_by_query、reindex
  • 使用高亮
  • 数据备份、改变mapping、升级索引
  • 通过原始字段debug查询或者聚合

2.8 _type

每条被索引的文档都有一个_type和_id字段,可以根据_type进行查询、聚合、脚本和排序。例子如下:

PUT my_index/type_1/1
{
  "text": "Document with type 1"
}

PUT my_index/type_2/2?refresh=true
{
  "text": "Document with type 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "inline": "doc[''_type'']" 
      }
    }
  }
}

2.9 _uid

_uid和_type和_index的组合。和_type一样,可用于查询、聚合、脚本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2?refresh=true
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "my_type#1", "my_type#2" ] 
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "inline": "doc[''_uid'']" 
      }
    }
  }
}

三、Mapping参数

3.1 analyzer

指定分词器(分析器更合理),对索引和查询都有效。如下,指定ik分词的配置:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

3.2 normalizer

normalizer用于解析前的标准化配置,比如把所有的字符转化为小写等。例子:

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

BÀR经过normalizer过滤以后转换为bar,文档1和文档2会被搜索到。

3.3 boost

boost字段用于设置字段的权重,比如,关键字出现在title字段的权重是出现在content字段中权重的2倍,设置mapping如下,其中content字段的默认权重是1.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

同样,在查询时指定权重也是一样的:

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

推荐在查询时指定boost,第一中在mapping中写死,如果不重新索引文档,权重无法修改,使用查询可以实现同样的效果。

3.4 coerce

coerce属性用于清除脏数据,coerce的默认值是true。整型数字5有可能会被写成字符串“5”或者浮点数5.0.coerce属性可以用来清除脏数据:

  • 字符串会被强制转换为整数
  • 浮点数被强制转换为整数

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "number_one": "10" 
}

PUT my_index/my_type/2
{
  "number_two": "10" 
}

mapping中指定number_one字段是integer类型,虽然插入的数据类型是String,但依然可以插入成功。number_two字段关闭了coerce,因此插入失败。

3.5 copy_to

copy_to属性用于配置自定义的_all字段。换言之,就是多个字段可以合并成一个超级字段。比如,first_name和last_name可以合并为full_name字段。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

3.6 doc_values

doc_values是为了加快排序、聚合操作,在建立倒排索引的时候,额外增加一个列式存储映射,是一个空间换时间的做法。默认是开启的,对于确定不需要聚合或者排序的字段可以关闭。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

注:text类型不支持doc_values。

3.7 dynamic

dynamic属性用于检测新发现的字段,有三个取值:

  • true:新发现的字段添加到映射中。(默认)
  • flase:新检测的字段被忽略。必须显式添加新字段。
  • strict:如果检测到新字段,就会引发异常并拒绝文档。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}

PS:取值为strict,非布尔值要加引号。

3.8 enabled

ELasticseaech默认会索引所有的字段,enabled设为false的字段,es会跳过字段内容,该字段只能从_source中获取,但是不可搜。而且字段可以是任意类型。

PUT my_index
{
  "mappings": {
    "session": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/session/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/session/session_2
{
  "user_id": "jpountz",
  "session_data": "none", 
  "last_updated": "2015-12-06T18:22:13"
}

3.9 fielddata

搜索要解决的问题是“包含查询关键词的文档有哪些?”,聚合恰恰相反,聚合要解决的问题是“文档包含哪些词项”,大多数字段再索引时生成doc_values,但是text字段不支持doc_values。

取而代之,text字段在查询时会生成一个fielddata的数据结构,fielddata在字段首次被聚合、排序、或者使用脚本的时候生成。ELasticsearch通过读取磁盘上的倒排记录表重新生成文档词项关系,最后在Java堆内存中排序。

text字段的fielddata属性默认是关闭的,开启fielddata非常消耗内存。在你开启text字段以前,想清楚为什么要在text类型的字段上做聚合、排序操作。大多数情况下这么做是没有意义的。

“New York”会被分析成“new”和“york”,在text类型上聚合会分成“new”和“york”2个桶,也许你需要的是一个“New York”。这是可以加一个不分析的keyword字段:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

上面的mapping中实现了通过my_field字段做全文搜索,my_field.keyword做聚合、排序和使用脚本。

3.10 format

format属性主要用于格式化日期:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

更多内置的日期格式:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

3.11 ignore_above

ignore_above用于指定字段索引和存储的长度最大值,超过最大值的会被忽略:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 15
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size": 0, 
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

mapping中指定了ignore_above字段的最大长度为15,第一个文档的字段长小于15,因此索引成功,第二个超过15,因此不索引,返回结果只有”Syntax error”,结果如下:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

3.12 ignore_malformed

ignore_malformed可以忽略不规则数据,对于login字段,有人可能填写的是date类型,也有人填写的是邮件格式。给一个字段索引不合适的数据类型发生异常,导致整个文档索引失败。如果ignore_malformed参数设为true,异常会被忽略,出异常的字段不会被索引,其它字段正常索引。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

上面的例子中number_one接受integer类型,ignore_malformed属性设为true,因此文档一种number_one字段虽然是字符串但依然能写入成功;number_two接受integer类型,默认ignore_malformed属性为false,因此写入失败。

3.13 include_in_all

include_in_all属性用于指定字段是否包含在_all字段里面,默认开启,除索引时index属性为no。 
例子如下,title和content字段包含在_all字段里,date不包含。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": { 
          "type": "text"
        },
        "content": { 
          "type": "text"
        },
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

include_in_all也可用于字段级别,如下my_type下的所有字段都排除在_all字段之外,author.first_name 和author.last_name 包含在in _all中:

PUT my_index
{
  "mappings": {
    "my_type": {
      "include_in_all": false, 
      "properties": {
        "title":          { "type": "text" },
        "author": {
          "include_in_all": true, 
          "properties": {
            "first_name": { "type": "text" },
            "last_name":  { "type": "text" }
          }
        },
        "editor": {
          "properties": {
            "first_name": { "type": "text" }, 
            "last_name":  { "type": "text", "include_in_all": true } 
          }
        }
      }
    }
  }
}

3.14 index

index属性指定字段是否索引,不索引也就不可搜索,取值可以为true或者false。

3.15 index_options

index_options控制索引时存储哪些信息到倒排索引中,接受以下配置:

参数 作用
docs 只存储文档编号
freqs 存储文档编号和词项频率
positions 文档编号、词项频率、词项的位置被存储,偏移位置可用于临近搜索和短语查询
offsets 文档编号、词项频率、词项的位置、词项开始和结束的字符位置都被存储,offsets设为true会使用Postings highlighter

3.16 fields

fields可以让同一文本有多种不同的索引方式,比如一个String类型的字段,可以使用text类型做全文检索,使用keyword类型做聚合和排序。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

3.17 norms

norms参数用于标准化文档,以便查询时计算文档的相关性。norms虽然对评分有用,但是会消耗较多的磁盘空间,如果不需要对某个字段进行评分,最好不要开启norms。

3.18 null_value

值为null的字段不索引也不可以搜索,null_value参数可以让值为null的字段显式的可索引、可搜索。例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "status_code": null
}

PUT my_index/my_type/2
{
  "status_code": [] 
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

文档1可以被搜索到,因为status_code的值为null,文档2不可以被搜索到,因为status_code为空数组,但是不是null。

3.19 position_increment_gap

为了支持近似或者短语查询,text字段被解析的时候会考虑此项的位置信息。举例,一个字段的值为数组类型:

 "names": [ "John Abraham", "Lincoln Smith"]
  • 1

为了区别第一个字段和第二个字段,Abraham和Lincoln在索引中有一个间距,默认是100。例子如下,这是查询”Abraham Lincoln”是查不到的:

PUT my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" 
            }
        }
    }
}

指定间距大于100可以查询到:

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 
            }
        }
    }
}

在mapping中通过position_increment_gap参数指定间距:

PUT my_index
{
  "mappings": {
    "groups": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 
        }
      }
    }
  }
}

3.20 properties

Object或者nested类型,下面还有嵌套类型,可以通过properties参数指定。

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

对应的文档结构:

PUT my_index/my_type/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

可以对manager.name、manager.age做搜索、聚合等操作。

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White" 
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age", 
            "interval": 5
          }
        }
      }
    }
  }
}

3.21 search_analyzer

大多数情况下索引和搜索的时候应该指定相同的分析器,确保query解析以后和索引中的词项一致。但是有时候也需要指定不同的分析器,例如使用edge_ngram过滤器实现自动补全。

默认情况下查询会使用analyzer属性指定的分析器,但也可以被search_analyzer覆盖。例子:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick Brown Fox" 
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", 
        "operator": "and"
      }
    }
  }
}

3.22 similarity

similarity参数用于指定文档评分模型,参数有三个:

  • BM25 :ES和Lucene默认的评分模型
  • classic :TF/IDF评分
  • boolean:布尔模型评分 
    例子:
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "default_field": { 
          "type": "text"
        },
        "classic_field": {
          "type": "text",
          "similarity": "classic" 
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" 
        }
      }
    }
  }
}

default_field自动使用BM25评分模型,classic_field使用TF/IDF经典评分模型,boolean_sim_field使用布尔评分模型。

3.23 store

默认情况下,自动是被索引的也可以搜索,但是不存储,这也没关系,因为_source字段里面保存了一份原始文档。在某些情况下,store参数有意义,比如一个文档里面有title、date和超大的content字段,如果只想获取title和date,可以这样:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] 
}

查询结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "fields": {
          "date": [
            "2015-01-01T00:00:00.000Z"
          ],
          "title": [
            "Some short title"
          ]
        }
      }
    ]
  }
}

Stored fields返回的总是数组,如果想返回原始字段,还是要从_source中取。

3.24 term_vector

词向量包含了文本被解析以后的以下信息:

  • 词项集合
  • 词项位置
  • 词项的起始字符映射到原始文档中的位置。

term_vector参数有以下取值:

参数取值 含义
no 默认值,不存储词向量
yes 只存储词项集合
with_positions 存储词项和词项位置
with_offsets 词项和字符偏移位置
with_positions_offsets 存储词项、词项位置、字符偏移位置

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

四、动态Mapping

4.1 default mapping

在mapping中使用default字段,那么其它字段会自动继承default中的设置。

PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

上面的mapping中,default中关闭了all字段,user会继承_default中的配置,因此user中的all字段也是关闭的,blogpost中开启_all,覆盖了_default的默认配置。

default被更新以后,只会对后面新加的文档产生作用。

4.2 Dynamic field mapping

文档中有一个之前没有出现过的字段被添加到ELasticsearch之后,文档的type mapping中会自动添加一个新的字段。这个可以通过dynamic属性去控制,dynamic属性为false会忽略新增的字段、dynamic属性为strict会抛出异常。如果dynamic为true的话,ELasticsearch会自动根据字段的值推测出来类型进而确定mapping:

JSON格式的数据 自动推测的字段类型
null 没有字段被添加
true or false boolean类型
floating类型数字 floating类型
integer long类型
JSON对象 object类型
数组 由数组中第一个非空值决定
string 有可能是date类型(开启日期检测)、double或long类型、text类型、keyword类型

日期检测默认是检测符合以下日期格式的字符串:

[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

例子:

PUT my_index/my_type/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping

mapping 如下,可以看到create_date为date类型:

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "create_date": {
            "type": "date",
            "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
          }
        }
      }
    }
  }
}

关闭日期检测:

PUT my_index
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}

PUT my_index/my_type/1 
{
  "create": "2015/09/02"
}

再次查看mapping,create字段已不再是date类型:

GET my_index/_mapping
返回结果:
{
  "my_index": {
    "mappings": {
      "my_type": {
        "date_detection": false,
        "properties": {
          "create": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

自定义日期检测的格式:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/my_type/1
{
  "create_date": "09/25/2015"
}

开启数字类型自动检测:

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

4.3 Dynamic templates

动态模板可以根据字段名称设置mapping,如下对于string类型的字段,设置mapping为:

  "mapping": { "type": "long"}

但是匹配字段名称为long_*格式的,不匹配*_text格式的:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/my_type/1
{
  "long_num": "5", 
  "long_text": "foo" 
}

写入文档以后,long_num字段为long类型,long_text扔为string类型。

4.4 Override default template

可以通过default字段覆盖所有索引的mapping配置,例子:

PUT _template/disable_all_field
{
  "order": 0,
  "template": "*", 
  "mappings": {
    "_default_": { 
      "_all": { 
        "enabled": false
      }
    }
  }
}

Elasticsearch 5.5 Mapping 详解

Elasticsearch 5.5 Mapping 详解

  • https://blog.csdn.net/zhanghytc/article/details/80667150  讲述 kibana 操作
  • 前言
  • 一 Field datatype 字段数据类型
    • 1string 类型
    • 2 text 类型
    • 3 keyword 类型
    • 4 数字类型
    • 5 Object 类型
    • 6 date 类型
    • 7 Array 类型
    • 8 binary 类型
    • 9 ip 类型
    • 10 range 类型
    • 11 nested 类型
    • 12token_count 类型
    • 13 geo point 类型
  • 二 Meta-Fields 元数据
    • 1 _all
    • 2 _field_names
    • 3 _id
    • 4 _index
    • 4 _meta
    • 5 _parent
    • 6 _routing
    • 7 _source
    • 8 _type
    • 9 _uid
  • 三 Mapping 参数
    • 1 analyzer
    • 2 normalizer
    • 3 boost
    • 4 coerce
    • 5 copy_to
    • 6 doc_values
    • 7 dynamic
    • 8 enabled
    • 9 fielddata
    • 10 format
    • 11 ignore_above
    • 12 ignore_malformed
    • 13 include_in_all
    • 14 index
    • 15 index_options
    • 16 fields
    • 17 norms
    • 18 null_value
    • 19 position_increment_gap
    • 20 properties
    • 21 search_analyzer
    • 22 similarity
    • 23 store
    • 24 term_vector
  • 四动态 Mapping
    • 1 default mapping
    • 2 Dynamic field mapping
    • 3 Dynamic templates
    • 4 Override default template

 


前言

 


 

一、Field datatype (字段数据类型)

1.1string 类型

ELasticsearch 5.X 之后的字段类型不再支持 string,由 text 或 keyword 取代。 如果仍使用 string,会给出警告。

测试:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type":  "string"
        }
      }
    }
  }
}

结果:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true
}

1.2 text 类型

text 取代了 string,当一个字段是要被全文搜索的,比如 Email 内容、产品描述,应该使用 text 类型。设置 text 类型以后,字段内容会被分析,在生成倒排索引以前,字符串会被分析器分成一个一个词项。text 类型的字段不用于排序,很少用于聚合(termsAggregation 除外)。

把 full_name 字段设为 text 类型的 Mapping 如下:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

1.3 keyword 类型

keyword 类型适用于索引结构化的字段,比如 email 地址、主机名、状态码和标签。如果字段需要进行过滤 (比如查找已发布博客中 status 属性为 published 的文章)、排序、聚合。keyword 类型的字段只能通过精确值搜索到。

1.4 数字类型

对于数字类型,ELasticsearch 支持以下几种:

类型 取值范围
long -2^63 至 2^63-1
integer -2^31 至 2^31-1
short -32,768 至 32768
byte -128 至 127
double 64 位双精度 IEEE 754 浮点类型
float 32 位单精度 IEEE 754 浮点类型
half_float 16 位半精度 IEEE 754 浮点类型
scaled_float 缩放类型的的浮点数(比如价格只需要精确到分,price 为 57.34 的字段缩放因子为 100,存起来就是 5734)

对于 float、half_float 和 scaled_float,-0.0 和 + 0.0 是不同的值,使用 term 查询查找 - 0.0 不会匹配 + 0.0,同样 range 查询中上边界是 - 0.0 不会匹配 + 0.0,下边界是 + 0.0 不会匹配 - 0.0。

对于数字类型的数据,选择以上数据类型的注意事项:

  1. 在满足需求的情况下,尽可能选择范围小的数据类型。比如,某个字段的取值最大值不会超过 100,那么选择 byte 类型即可。迄今为止吉尼斯记录的人类的年龄的最大值为 134 岁,对于年龄字段,short 足矣。字段的长度越短,索引和搜索的效率越高。
  2. 优先考虑使用带缩放因子的浮点类型。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

1.5 Object 类型

JSON 天生具有层级关系,文档会包含嵌套的对象:

PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

上面的文档中,整体是一个 JSON,JSON 中包含一个 manager,manager 又包含一个 name。最终,文档会被索引成一平的 key-value 对:

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

上面文档结构的 Mapping 如下:

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { 
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}

1.6 date 类型

JSON 中没有日期类型,所以在 ELasticsearch 中,日期类型可以是以下几种:

  1. 日期格式的字符串:e.g. “2015-01-01” or “2015/01/01 12:10:30”.
  2. long 类型的毫秒数 (milliseconds-since-the-epoch)
  3. integer 的秒数 (seconds-since-the-epoch)

日期格式可以自定义,如果没有自定义,默认格式如下:

"strict_date_optional_time||epoch_millis"

 

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "date": "2015-01-01" } 

PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my_index/my_type/3
{ "date": 1420070400001 } 

GET my_index/_search
{
  "sort": { "date": "asc"} 
}

 

查看三个日期类型:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "date": "2015-01-01"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "date": 1420070400001
        }
      }
    ]
  }
}

排序结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": null,
        "_source": {
          "date": "2015-01-01"
        },
        "sort": [
          1420070400000
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": null,
        "_source": {
          "date": 1420070400001
        },
        "sort": [
          1420070400001
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": null,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        },
        "sort": [
          1420114230000
        ]
      }
    ]
  }
}

1.7 Array 类型

ELasticsearch 没有专用的数组类型,默认情况下任何字段都可以包含一个或者多个值,但是一个数组中的值要是同一种类型。例如:

  1. 字符数组: [“one”, “two” ]
  2. 整型数组:[1,3]
  3. 嵌套数组:[1,[2,3]], 等价于 [1,2,3]
  4. 对象数组:[{ “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

注意事项:

  • 动态添加数据时,数组的第一个值的类型决定整个数组的类型
  • 混合数组类型是不支持的,比如:[1,”abc”]
  • 数组可以包含 null 值,空数组 [ ] 会被当做 missing field 对待。

1.8 binary 类型

binary 类型接受 base64 编码的字符串,默认不存储也不可搜索。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

搜索 blog 字段:

GET my_index/_search
{
  "query": {
    "match": {
      "blob": "test" 
    }
  }
}

返回结果:
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Binary fields do not support searching",
        "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
        "index": "my_index"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "3dQd1RRVTMiKdTckM68nPQ",
        "reason": {
          "type": "query_shard_exception",
          "reason": "Binary fields do not support searching",
          "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
          "index": "my_index"
        }
      }
    ]
  },
  "status": 400
}

Base64 加密、解码工具:http://www1.tc711.com/tool/BASE64.htm

1.9 ip 类型

ip 类型的字段用于存储 IPV4 或者 IPV6 的地址。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

1.10 range 类型

range 类型支持以下几种:

类型 范围
integer_range -2^31 至 2^31-1
float_range 32-bit IEEE 754
long_range -2^63 至 2^63-1
double_range 64-bit IEEE 754
date_range 64 位整数,毫秒计时

range 类型的使用场景:比如前端的时间选择表单、年龄范围选择表单等。 
例子:

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

PUT range_index/my_type/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

上面代码创建了一个 range_index 索引,expected_attendees 的人数为 10 到 20,时间是 2015-10-31 12:00:00 至 2015-11-01。

查询:

POST range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-08-01",
        "lte" : "2015-12-01",
        "relation" : "within" 
      }
    }
  }
}

查询结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "range_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "expected_attendees": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2015-10-31 12:00:00",
            "lte": "2015-11-01"
          }
        }
      }
    ]
  }
}

1.11 nested 类型

nested 嵌套类型是 object 中的一个特例,可以让 array 类型的 Object 独立索引和查询。 使用 Object 类型有时会出现问题,比如文档 my_index/my_type/1 的结构如下:

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

 

user 字段会被动态添加为 Object 类型。 
最后会被转换为以下平整的形式:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

 

user.first 和 user.last 会被平铺为多值字段,Alice 和 White 之间的关联关系会消失。上面的文档会不正确的匹配以下查询 (虽然能搜索到,实际上不存在 Alice Smith):

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

使用 nested 字段类型解决 Object 类型的不足:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

1.12token_count 类型

token_count 用于统计词频:


PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "length": { 
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "name": "John Smith" }

PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }

GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 
    }
  }
}

1.13 geo point 类型

地理位置信息类型用于存储地理位置信息的经纬度:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}

PUT my_index/my_type/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" 
}

PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

GET my_index/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

二、Meta-Fields (元数据)

2.1 _all

_all 字段是把其它字段拼接在一起的超级字段,所有的字段用空格分开,_all 字段会被解析和索引,但是不存储。当你只想返回包含某个关键字的文档但是不明确地搜某个字段的时候就需要使用_all 字段。 
例子:

PUT my_index/blog/1 
{
  "title":    "Master Java",
  "content":     "learn java",
  "author": "Tom"
}

_all 字段包含:[“Master”, “Java”, “learn”, “Tom” ]

搜索:

GET my_index/_search
{
  "query": {
    "match": {
      "_all": "Java"
    }
  }
}

返回结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.39063013,
    "hits": [
      {
        "_index": "my_index",
        "_type": "blog",
        "_id": "1",
        "_score": 0.39063013,
        "_source": {
          "title": "Master Java",
          "content": "learn java",
          "author": "Tom"
        }
      }
    ]
  }
}

使用 copy_to 自定义_all 字段:

PUT myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "title": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "content": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "full_content": {
          "type":    "text"
        }
      }
    }
  }
}

PUT myindex/mytype/1
{
  "title": "Master Java",
  "content": "learn Java"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_content": "java"
    }
  }
}

2.2 _field_names

_field_names 字段用来存储文档中的所有非空字段的名字,这个字段常用于 exists 查询。例子如下:

PUT my_index/my_type/1
{
  "title": "This is a document"
}

PUT my_index/my_type/2?refresh=true
{
  "title": "This is another document",
  "body": "This document has a body"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "body" ] 
    }
  }
}

结果会返回第二条文档,因为第一条文档没有 title 字段。 
同样,可以使用 exists 查询:

GET my_index/_search
{
    "query": {
        "exists" : { "field" : "body" }
    }
}

2.3 _id

每条被索引的文档都有一个_type 和_id 字段,_id 可以用于 term 查询、temrs 查询、match 查询、query_string 查询、simple_query_string 查询,但是不能用于聚合、脚本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] 
    }
  }
}

2.4 _index

多索引查询时,有时候只需要在特地索引名上进行查询,_index 字段提供了便利,也就是说可以对索引名进行 term 查询、terms 查询、聚合分析、使用脚本和排序。

_index 是一个虚拟字段,不会真的加到 Lucene 索引中,对_index 进行 term、terms 查询 (也包括 match、query_string、simple_query_string),但是不支持 prefix、wildcard、regexp 和 fuzzy 查询。

举例,2 个索引 2 条文档


PUT index_1/my_type/1
{
  "text": "Document in index 1"
}

PUT index_2/my_type/2
{
  "text": "Document in index 2"
}

对索引名做查询、聚合、排序并使用脚本新增字段:

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "inline": "doc[''_index'']" 
      }
    }
  }
}

 

2.4 _meta

忽略

2.5 _parent

_parent 用于指定同一索引中文档的父子关系。下面例子中现在 mapping 中指定文档的父子关系,然后索引父文档,索引子文档时指定父 id,最后根据子文档查询父文档。

PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}


PUT my_index/my_parent/1 
{
  "text": "This is a parent document"
}

PUT my_index/my_child/2?parent=1 
{
  "text": "This is a child document"
}

PUT my_index/my_child/3?parent=1&refresh=true 
{
  "text": "This is another child document"
}


GET my_index/my_parent/_search
{
  "query": {
    "has_child": { 
      "type": "my_child",
      "query": {
        "match": {
          "text": "child document"
        }
      }
    }
  }
}

2.6 _routing

路由参数,ELasticsearch 通过以下公式计算文档应该分到哪个分片上:

shard_num = hash(_routing) % num_primary_shards

 

默认的_routing 值是文档的_id 或者_parent,通过_routing 参数可以设置自定义路由。例如,想把 user1 发布的博客存储到同一个分片上,索引时指定 routing 参数,查询时在指定路由上查询:

PUT my_index/my_type/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my_index/my_type/1?routing=user1

在查询的时候通过 routing 参数查询:

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  }
}

GET my_index/_search?routing=user1,user2 
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}

 

在 Mapping 中指定 routing 为必须的:

PUT my_index2
{
  "mappings": {
    "my_type": {
      "_routing": {
        "required": true 
      }
    }
  }
}

PUT my_index2/my_type/1 
{
  "text": "No routing value provided"
}

 

2.7 _source

存储的文档的原始值。默认_source 字段是开启的,也可以关闭:

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

 

但是一般情况下不要关闭,除法你不想做一些操作:

  • 使用 update、update_by_query、reindex
  • 使用高亮
  • 数据备份、改变 mapping、升级索引
  • 通过原始字段 debug 查询或者聚合

2.8 _type

每条被索引的文档都有一个_type 和_id 字段,可以根据_type 进行查询、聚合、脚本和排序。例子如下:

PUT my_index/type_1/1
{
  "text": "Document with type 1"
}

PUT my_index/type_2/2?refresh=true
{
  "text": "Document with type 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "inline": "doc[''_type'']" 
      }
    }
  }
}

2.9 _uid

_uid 和_type 和_index 的组合。和_type 一样,可用于查询、聚合、脚本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2?refresh=true
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "my_type#1", "my_type#2" ] 
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "inline": "doc[''_uid'']" 
      }
    }
  }
}

三、Mapping 参数

3.1 analyzer

指定分词器 (分析器更合理),对索引和查询都有效。如下,指定 ik 分词的配置:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

3.2 normalizer

normalizer 用于解析前的标准化配置,比如把所有的字符转化为小写等。例子:

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

BÀR 经过 normalizer 过滤以后转换为 bar,文档 1 和文档 2 会被搜索到。

3.3 boost

boost 字段用于设置字段的权重,比如,关键字出现在 title 字段的权重是出现在 content 字段中权重的 2 倍,设置 mapping 如下,其中 content 字段的默认权重是 1.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

 

同样,在查询时指定权重也是一样的:

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

推荐在查询时指定 boost,第一中在 mapping 中写死,如果不重新索引文档,权重无法修改,使用查询可以实现同样的效果。

3.4 coerce

coerce 属性用于清除脏数据,coerce 的默认值是 true。整型数字 5 有可能会被写成字符串 “5” 或者浮点数 5.0.coerce 属性可以用来清除脏数据:

  • 字符串会被强制转换为整数
  • 浮点数被强制转换为整数

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "number_one": "10" 
}

PUT my_index/my_type/2
{
  "number_two": "10" 
}

mapping 中指定 number_one 字段是 integer 类型,虽然插入的数据类型是 String,但依然可以插入成功。number_two 字段关闭了 coerce,因此插入失败。

3.5 copy_to

copy_to 属性用于配置自定义的_all 字段。换言之,就是多个字段可以合并成一个超级字段。比如,first_name 和 last_name 可以合并为 full_name 字段。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

3.6 doc_values

doc_values 是为了加快排序、聚合操作,在建立倒排索引的时候,额外增加一个列式存储映射,是一个空间换时间的做法。默认是开启的,对于确定不需要聚合或者排序的字段可以关闭。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

注:text 类型不支持 doc_values。

3.7 dynamic

dynamic 属性用于检测新发现的字段,有三个取值:

  • true: 新发现的字段添加到映射中。(默认)
  • flase: 新检测的字段被忽略。必须显式添加新字段。
  • strict: 如果检测到新字段,就会引发异常并拒绝文档。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}

PS:取值为 strict,非布尔值要加引号。

3.8 enabled

ELasticseaech 默认会索引所有的字段,enabled 设为 false 的字段,es 会跳过字段内容,该字段只能从_source 中获取,但是不可搜。而且字段可以是任意类型。

PUT my_index
{
  "mappings": {
    "session": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/session/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/session/session_2
{
  "user_id": "jpountz",
  "session_data": "none", 
  "last_updated": "2015-12-06T18:22:13"
}

3.9 fielddata

搜索要解决的问题是 “包含查询关键词的文档有哪些?”,聚合恰恰相反,聚合要解决的问题是 “文档包含哪些词项”,大多数字段再索引时生成 doc_values,但是 text 字段不支持 doc_values。

取而代之,text 字段在查询时会生成一个 fielddata 的数据结构,fielddata 在字段首次被聚合、排序、或者使用脚本的时候生成。ELasticsearch 通过读取磁盘上的倒排记录表重新生成文档词项关系,最后在 Java 堆内存中排序。

text 字段的 fielddata 属性默认是关闭的,开启 fielddata 非常消耗内存。在你开启 text 字段以前,想清楚为什么要在 text 类型的字段上做聚合、排序操作。大多数情况下这么做是没有意义的。

“New York” 会被分析成 “new” 和 “york”,在 text 类型上聚合会分成 “new” 和 “york” 2 个桶,也许你需要的是一个 “New York”。这是可以加一个不分析的 keyword 字段:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

 

上面的 mapping 中实现了通过 my_field 字段做全文搜索,my_field.keyword 做聚合、排序和使用脚本。

3.10 format

format 属性主要用于格式化日期:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

 

更多内置的日期格式:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

3.11 ignore_above

ignore_above 用于指定字段索引和存储的长度最大值,超过最大值的会被忽略:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 15
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size": 0, 
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

 

mapping 中指定了 ignore_above 字段的最大长度为 15,第一个文档的字段长小于 15,因此索引成功,第二个超过 15,因此不索引,返回结果只有”Syntax error”, 结果如下:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

 

3.12 ignore_malformed

ignore_malformed 可以忽略不规则数据,对于 login 字段,有人可能填写的是 date 类型,也有人填写的是邮件格式。给一个字段索引不合适的数据类型发生异常,导致整个文档索引失败。如果 ignore_malformed 参数设为 true,异常会被忽略,出异常的字段不会被索引,其它字段正常索引。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

 

上面的例子中 number_one 接受 integer 类型,ignore_malformed 属性设为 true,因此文档一种 number_one 字段虽然是字符串但依然能写入成功;number_two 接受 integer 类型,默认 ignore_malformed 属性为 false,因此写入失败。

3.13 include_in_all

include_in_all 属性用于指定字段是否包含在_all 字段里面,默认开启,除索引时 index 属性为 no。 
例子如下,title 和 content 字段包含在_all 字段里,date 不包含。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": { 
          "type": "text"
        },
        "content": { 
          "type": "text"
        },
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

 

include_in_all 也可用于字段级别,如下 my_type 下的所有字段都排除在_all 字段之外,author.first_name 和 author.last_name 包含在 in _all 中:

PUT my_index
{
  "mappings": {
    "my_type": {
      "include_in_all": false, 
      "properties": {
        "title":          { "type": "text" },
        "author": {
          "include_in_all": true, 
          "properties": {
            "first_name": { "type": "text" },
            "last_name":  { "type": "text" }
          }
        },
        "editor": {
          "properties": {
            "first_name": { "type": "text" }, 
            "last_name":  { "type": "text", "include_in_all": true } 
          }
        }
      }
    }
  }
}

3.14 index

index 属性指定字段是否索引,不索引也就不可搜索,取值可以为 true 或者 false。

3.15 index_options

index_options 控制索引时存储哪些信息到倒排索引中,接受以下配置:

参数 作用
docs 只存储文档编号
freqs 存储文档编号和词项频率
positions 文档编号、词项频率、词项的位置被存储,偏移位置可用于临近搜索和短语查询
offsets 文档编号、词项频率、词项的位置、词项开始和结束的字符位置都被存储,offsets 设为 true 会使用 Postings highlighter

3.16 fields

fields 可以让同一文本有多种不同的索引方式,比如一个 String 类型的字段,可以使用 text 类型做全文检索,使用 keyword 类型做聚合和排序。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

 

3.17 norms

norms 参数用于标准化文档,以便查询时计算文档的相关性。norms 虽然对评分有用,但是会消耗较多的磁盘空间,如果不需要对某个字段进行评分,最好不要开启 norms。

3.18 null_value

值为 null 的字段不索引也不可以搜索,null_value 参数可以让值为 null 的字段显式的可索引、可搜索。例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "status_code": null
}

PUT my_index/my_type/2
{
  "status_code": [] 
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

 

文档 1 可以被搜索到,因为 status_code 的值为 null,文档 2 不可以被搜索到,因为 status_code 为空数组,但是不是 null。

3.19 position_increment_gap

为了支持近似或者短语查询,text 字段被解析的时候会考虑此项的位置信息。举例,一个字段的值为数组类型:

 "names": [ "John Abraham", "Lincoln Smith"]

 

为了区别第一个字段和第二个字段,Abraham 和 Lincoln 在索引中有一个间距,默认是 100。例子如下,这是查询”Abraham Lincoln” 是查不到的:

PUT my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" 
            }
        }
    }
}

 

指定间距大于 100 可以查询到:

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 
            }
        }
    }
}

在 mapping 中通过 position_increment_gap 参数指定间距:

PUT my_index
{
  "mappings": {
    "groups": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 
        }
      }
    }
  }
}

3.20 properties

Object 或者 nested 类型,下面还有嵌套类型,可以通过 properties 参数指定。

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

对应的文档结构:

PUT my_index/my_type/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

 

可以对 manager.name、manager.age 做搜索、聚合等操作。

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White" 
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age", 
            "interval": 5
          }
        }
      }
    }
  }
}

3.21 search_analyzer

大多数情况下索引和搜索的时候应该指定相同的分析器,确保 query 解析以后和索引中的词项一致。但是有时候也需要指定不同的分析器,例如使用 edge_ngram 过滤器实现自动补全。

默认情况下查询会使用 analyzer 属性指定的分析器,但也可以被 search_analyzer 覆盖。例子:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick Brown Fox" 
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", 
        "operator": "and"
      }
    }
  }
}

 

3.22 similarity

similarity 参数用于指定文档评分模型,参数有三个:

  • BM25 :ES 和 Lucene 默认的评分模型
  • classic :TF/IDF 评分
  • boolean:布尔模型评分 
    例子:
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "default_field": { 
          "type": "text"
        },
        "classic_field": {
          "type": "text",
          "similarity": "classic" 
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" 
        }
      }
    }
  }
}

default_field 自动使用 BM25 评分模型,classic_field 使用 TF/IDF 经典评分模型,boolean_sim_field 使用布尔评分模型。

3.23 store

默认情况下,自动是被索引的也可以搜索,但是不存储,这也没关系,因为_source 字段里面保存了一份原始文档。在某些情况下,store 参数有意义,比如一个文档里面有 title、date 和超大的 content 字段,如果只想获取 title 和 date,可以这样:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] 
}

 

查询结果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "fields": {
          "date": [
            "2015-01-01T00:00:00.000Z"
          ],
          "title": [
            "Some short title"
          ]
        }
      }
    ]
  }
}

 

Stored fields 返回的总是数组,如果想返回原始字段,还是要从_source 中取。

3.24 term_vector

词向量包含了文本被解析以后的以下信息:

  • 词项集合
  • 词项位置
  • 词项的起始字符映射到原始文档中的位置。

term_vector 参数有以下取值:

参数取值 含义
no 默认值,不存储词向量
yes 只存储词项集合
with_positions 存储词项和词项位置
with_offsets 词项和字符偏移位置
with_positions_offsets 存储词项、词项位置、字符偏移位置

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

四、动态 Mapping

4.1 default mapping

在 mapping 中使用 default 字段,那么其它字段会自动继承 default 中的设置。

PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

 

上面的 mapping 中,default 中关闭了 all 字段,user 会继承_default 中的配置,因此 user 中的 all 字段也是关闭的,blogpost 中开启_all,覆盖了_default 的默认配置。

default 被更新以后,只会对后面新加的文档产生作用。

4.2 Dynamic field mapping

文档中有一个之前没有出现过的字段被添加到 ELasticsearch 之后,文档的 type mapping 中会自动添加一个新的字段。这个可以通过 dynamic 属性去控制,dynamic 属性为 false 会忽略新增的字段、dynamic 属性为 strict 会抛出异常。如果 dynamic 为 true 的话,ELasticsearch 会自动根据字段的值推测出来类型进而确定 mapping:

JSON 格式的数据 自动推测的字段类型
null 没有字段被添加
true or false boolean 类型
floating 类型数字 floating 类型
integer long 类型
JSON 对象 object 类型
数组 由数组中第一个非空值决定
string 有可能是 date 类型(开启日期检测)、double 或 long 类型、text 类型、keyword 类型

日期检测默认是检测符合以下日期格式的字符串:

[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

例子:

PUT my_index/my_type/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping

mapping 如下,可以看到 create_date 为 date 类型:

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "create_date": {
            "type": "date",
            "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
          }
        }
      }
    }
  }
}

关闭日期检测:

PUT my_index
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}

PUT my_index/my_type/1 
{
  "create": "2015/09/02"
}

再次查看 mapping,create 字段已不再是 date 类型:

GET my_index/_mapping
返回结果:
{
  "my_index": {
    "mappings": {
      "my_type": {
        "date_detection": false,
        "properties": {
          "create": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

自定义日期检测的格式:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/my_type/1
{
  "create_date": "09/25/2015"
}

开启数字类型自动检测:

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

4.3 Dynamic templates

动态模板可以根据字段名称设置 mapping,如下对于 string 类型的字段,设置 mapping 为:

  "mapping": { "type": "long"}

 

但是匹配字段名称为 long_* 格式的,不匹配 *_text 格式的:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/my_type/1
{
  "long_num": "5", 
  "long_text": "foo" 
}

写入文档以后,long_num 字段为 long 类型,long_text 扔为 string 类型。

4.4 Override default template

可以通过 default 字段覆盖所有索引的 mapping 配置,例子:

PUT _template/disable_all_field
{
  "order": 0,
  "template": "*", 
  "mappings": {
    "_default_": { 
      "_all": { 
        "enabled": false
      }
    }
  }
}

Elasticsearch Mapping

Elasticsearch Mapping

1.  Mapping(映射)

Mapping 是定义文档及其包含的字段是如何存储和索引的过程

例如,我们用映射来定义:

  • 哪些字符串字段应该被当做全文字段
  • 哪些字段包含数字、日期或地理位置
  • 是否应该将文档中所有字段的值索引到catch-all字段中

1.1.  Mapping Type(映射类型)

每个索引都有一个映射类型,以决定文档将被如何索引

映射类型包含两部分:

Meta-fields

  Meta-fields通常用于自定义文档的元数据。例如,Meta-fields包括文档的 _index, _type, _id, _source等字段

Fields 或 properties

  一个映射类型包含一个字段列表或属性列表

1.2.  Field datatypes(字段数据类型)

每个字段有一个数据类型,它可以是下列之一:

  • 简单类型,比如 text, keyword, date, long, double, boolean , ip
  • 支持JSON层级结构的类型,比如 object 或者 nested
  • 特别的类型,比如 geo_point, geo_shape, completion

1.3.  Example mapping

curl -X PUT "localhost:9200/my_index" -H ''Content-Type: application/json'' -d''
{
    mappings": {
        doc: { 
            properties: { 
                title":    { type": text  },name":     { age":      { integer },1)">created:  {
                    ":   date,1)">formatstrict_date_optional_time||epoch_millis
                }
            }
        }
    }
}
''

创建一个索引名字叫“my_index”,并且添加一个映射类型叫“doc”,包含4个字段

2.  Field datatypes(字段类型)

2.1.  核心类型

字符串类型

  textkeyword

数值类型

  longintegershortbytedoublefloathalf_floatscaled_float

日期类型

  date

布尔类型

  boolean

二进制类型

  binary

范围类型

  integer_range float_rangelong_rangedouble_rangedate_range

2.2.  复杂类型

数组类型

  数组不需要一个专门的类型

对象类型

  object (PS:单个JSON对象)

内嵌类型

  nested(PS:JSON对象数组)

2.3.  地理类型

Geo_point类型

  geo_point 用于地理位置经纬度坐标

Geo_shape类型

  geo_shape 用于复杂形状

2.4.  专门的数据类型

IP类型

  ip (用于IPv4和IPv6地址)

Completion类型

  completion (用于自动补全提示)

Token count 类型

  token_count (用于计数字符串中的token)

mapper-murmur3

  murmur3 (计算值的hashcode,并将其存储到索引中)

过滤器类型

  接受一个查询语句

join 类型

  为同一索引内的文档定义父/子关系

3.  Meta-fields(元数据字段)

每个文档都有与之关联的元数据

3.1.  标识  元数据字段

  _index  文档属于哪个索引

  _id     文档ID

  _type    文档的映射类型

  _uid   由 _type和 _id组成的一个组合字段

3.2. 文档来源  元数据字段 

  _source  文档的原始JSON

  _size    _source字段的长度(多少字节)

3.3.  索引  元数据字段

  _all    索引其它字段的值,默认情况下是禁用的

  _field_names  所有非空字段

3.4.  路由  元数据字段

  _routing  一个自定义的路由值,用于分片的

3.5.  其它  元数据字段

  _Meta   其它

4.  小结

如果把Elasticsearch比作关系型数据库的话,那么,映射就是建表,映射类型就是存储引擎,字段类型就是字段类型

总结

以上是小编为你收集整理的Elasticsearch Mapping全部内容。

如果觉得小编网站内容还不错,欢迎将小编网站推荐给好友。

关于elasticsearch Mapping 定义索引elasticsearch mapping 定义索引内容长度的问题我们已经讲解完毕,感谢您的阅读,如果还想了解更多关于43、elasticsearch(搜索引擎)的mapping映射管理、Elasticsearch 5.4 Mapping详解、Elasticsearch 5.5 Mapping 详解、Elasticsearch Mapping等相关内容,可以在本站寻找。

本文标签: