OpenSearch Query DSL#

Keywords: AWS, Amazon, OpenSearch, OS, OSS, Doc, Docs, Query, DSL

query domain-specific language (DSL) 是 OpenSearch 的核心查询语法. 用的是 JSON 格式. 本节对查询语法相关的内容进行了一个总结. 便于以后查阅.


Query and filter context#

查询语言里分两个大块, filter 和 query.


  • filter 指的是对结果进行过滤, 和 SQL 中的 where 类似.

  • 从每个 Document 的视角来看, filter 的结果只有 match 和不 match.

  • 从查询引擎的视角看, filter 是直接在内存中进行比较实现的.


  • 而 query 指的是一个 document 跟你的查询的相关性.

从每个 Document 的视角看, query 的结果是一个代表着相关性的数值. - 从查询引擎的视角看, query 是使用 Index 实现的.


Term-level and full-text queries compared#

Query 主要有两类. Term 和 Full-text search (FTS). Term 指的是那种确定性的, 决定性的. 例如 SQL 中的 WHERE A = B, WHERE A < B < C. 这些也走的是 Index. FTS 是基于倒排索引, 文本相关性算法, 分词技术, 同义词技术, 模糊搜索技术等一系列技术实现的一种针对文本的搜索.


Term-level queries#

Term level 的 query 有以下几种:

  • term: Searches for documents containing an exact term in a specific field.

  • terms: Searches for documents containing one or more terms in a specific field.

  • terms_set: Searches for documents that match a minimum number of terms in a specific field.

  • ids: Searches for documents by document ID.

  • range: Searches for documents with field values in a specific range.

  • prefix: Searches for documents containing terms that begin with a specific prefix.

  • exists: Searches for documents with any indexed value in a specific field.

  • fuzzy: Searches for documents containing terms that are similar to the search term within the maximum allowed Levenshtein distance. The Levenshtein distance measures the number of one-character changes needed to change one term to another term.

  • wildcard: Searches for documents containing terms that match a wildcard pattern.

  • regexp: Searches for documents containing terms that match a regular expression.


Full-text queries#

FTS 的 query 有以下几种:

  • intervals: Allows fine-grained control of the matching terms’ proximity and order.

  • match: The default full-text query, which can be used for fuzzy matching and phrase or proximity searches.

  • match_bool_prefix: Creates a Boolean query that matches all terms in any position, treating the last term as a prefix.

  • match_phrase:Similar to the match query but matches a whole phrase up to a configurable slop.

  • match_phrase_prefix:Similar to the match_phrase query but matches terms as a whole phrase, treating the last term as a prefix.

  • multi_match: Similar to the match query but is used on multiple fields.

  • query_string: Uses a strict syntax to specify Boolean conditions and multi-field search within a single query string.

  • simple_query_string: A simpler, less strict version of query_string query.


Compound queries#

Compound queries 指的是将多个搜索条件用逻辑运算排列组合起来.

  • bool (Boolean): Combines multiple query clauses with Boolean logic.

  • boosting: Changes the relevance score of documents without removing them from the search results. Returns documents that match a positive query, but downgrades the relevance of documents in the results that match a negative query.

  • constant_score: Wraps a query or a filter and assigns a constant score to all matching documents. This score is equal to the boost value.

  • dis_max (disjunction max): Returns documents that match one or more query clauses. If a document matches multiple query clauses, it is assigned a higher relevance score. The relevance score is calculated using the highest score from any matching clause and, optionally, the scores from the other matching clauses multiplied by the tiebreaker value.

  • function_score: Recalculates the relevance score of documents that are returned by a query using a function that you define.

  • hybrid: Combines relevance scores from multiple queries into one score for a given document.


Geographic and xy queries#

Geographic and xy queries 是 为地理坐标高度优化的查询.


Span queries#

Span 就是一个一个的 word 的意思. Span query 是对 span 在文档中的位置顺序 (官方的定义是 “span” refers to a contiguous sequence of words or tokens within a document), 以及相互的位置关系的查询. 例如一个词必须出现在一个词的后面的多少个词以内. 这种 query 常用于 Legal document 和 pattern 文件. 例如在租房 lease 中你搜索 rent 附近的数字, 认为这个数字就是租金.


Match all queries#

Match all 就相当于 SQL 中的 SELECT *.


Specialized queries#

一些特殊的 query.

  • distance_feature: Calculates document scores based on the dynamically calculated distance between the origin and a document’s date, date_nanos, or geo_point fields. This query can skip non-competitive hits.

  • more_like_this: Finds documents similar to the provided text, document, or collection of documents.

  • neural: Used for vector field search in neural search. (一种用神经网络技术优化 query, 把上下文纳入考量, 不局限于 query 的文本本身, 也会搜索相关的概念的一种搜搜技术)

  • neural_sparse: Used for vector field search in sparse neural search.

  • percolate: Finds queries (stored as documents) that match the provided document. (就是反过来输入一个文档, 返回可以 match 这篇文档的 query. 前提是你把 query body 当成 document 已经按照指定格式存到了 index 中)

  • rank_feature: Calculates scores based on the values of numeric features. This query can skip non-competitive hits.

  • script: Uses a script as a filter. (你可以编写脚本用来进行过滤. 默认的脚本语言叫做 painless, 还有其他脚本语言可选, 你可以自定义一些 if else 来定义复杂的过滤逻辑)

  • script_score: Calculates a custom score for matching documents using a script.

  • wrapper: Accepts other queries as JSON or YAML strings. (相当于不用原始的 JSON 语法, 而是将一个 query 序列化为 JSON 然后作为字符串传入. 这种 query 常用于你动态构建 query 的情况).