# dianping **Repository Path**: zhangxin123456/dianping ## Basic Information - **Project Name**: dianping - **Description**: 慕课网《elasticsearch+spark实现千人千面推荐系统》龙虾三少 - **Primary Language**: Java - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 2 - **Created**: 2024-01-02 - **Last Updated**: 2024-01-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #ES [官方文档](https://www.elastic.co/guide/en/elasticsearch/reference/7.3/index.html) ### ES的搜索原理 * 独立的网络上的一个或一组进程节点 * 对外提供搜索服务(http或transport协议) transport被废弃 * 对内就是一个搜索数据库 ### 名词定义 * 7版本中Type被废弃,索引和类型合并为索引 | Relational database | ElasticSearch | | :----: | :----:| | Database | Index | | Table | Type | | Row | Document | | Column | Field | | Schema | Mapping | | Index | Everything is indexed | | SQL | Query DSL | | SQL | Query DSL | | Select * from table... | GET http://.. | | Update table set | PUT http://.. | ### 索引 * 搜索中的数据库或表定义 * 构建文档的时候的缩影创建 ### 分词 * 搜索是以词为单位做最基本的搜索单元 * 依靠分词器构建分词 * 用分词器构建倒排索引 * 正向索引(以document为索引的入口)和倒排索引(以分词为索引的入口) ### TF-IDF打分 * TF:token frequency, 分词在document字段(待搜索的字段)中出现的次数 * IDF:inverse document frequency, 逆文档频率,代表分词在整个文档中出现的频率,取反 * TFNORM:token frequency normalized 词频归一化 * BM25:解决词频问题(TF公式的分母) | 单词ID | 单词 | 文档频率 | 倒排列表(DocID;TF;<POS>)| | :----: | :----:| :----: | :----:| | 1 | 谷歌 | 5 | (1;1;<1>),(2;1;<1>),(3;2;<1;6>),(4;1;<1>),(5;1;<1>)| | 2 | 地图 | 5 | (1;1;<2>),(2;1;<2>),(3;1;<2>),(4;1;<2>),(5;1;<2>)| | 3 | 之父 | 4 | (1;1;<3>),(2;1;<3>),(4;1;<3>),(5;1;<3>)| | 4 | 跳槽 | 2 | (1;1;<4>),(4;1;<4>)| | 5 | Facebook | 5 | (1;1;<5>),(2;1;<5>),(3;1;<8>),(4;1;<5>),(5;1;<8>)| | 6 | 加盟 | 3 | (2;1;<4>),(3;1;<7>),(5;1;<5>)| | 7 | 创始人 | 1 | (3;1;<3>)| | 8 | 拉斯 | 2 | (3;1;<4>),(5;1;<4>)| | 9 | 离开 | 1 | (3;1;<5>)| | 10 | 与 | 1 | (4;1;<6>)| ### 安装ES7.3.0和Kibana7.3.0 * http://127.0.0.1:9200/_cat/health * http://127.0.0.1:9200/_cat/nodes * http://localhost:5601 ### 分布式原理 * 分片 主从 路由 * 负载均衡和读写分离 * 主分片和副本数 ### 分布式部署 * 配置文件 ``` #集群名称 cluster.name: my-application #节点名称 node.name: node-1 #绑定IP地址 network.host: 127.0.0.1 #http监听端口 http.port: 9200 #集群之间的通信端口 transport.tcp.port: 9300 #允许跨域 http.cors.enabled: true http.cors.allow-origin: "*" #发现集群节点 discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301", "127.0.0.1:9302"] #是否有资格竞选主节点 cluster.initial_master_nodes: ["127.0.0.1:9300", "127.0.0.1:9301", "127.0.0.1:9302"] ``` ### 基础语法 ``` 删除索引 DELETE /test 创建索引 PUT /test { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } } DELETE /employee 非结构化方式新建索引 PUT /employee/_doc/3 { "name":"凯杰", "age":18 } 修改(会删除age字段) PUT /employee/_doc/1 { "name":"凯杰2" } 查询 GET /employee/_doc/1 指定字段修改 POST /employee/_update/1 { "doc":{ "name":"凯杰4" } } 强制指定创建,若已存在,则失败 gender不存在的时候会自动创建 POST /employee/_create/3 { "name":"凯杰2", "age":30, "gender":"男" } 删除 DELETE /employee/_doc/1 查询全部文档 默认存在分页 GET /employee/_search 使用结构化的方式创建索引 PUT /employee { "settings": { "number_of_shards": 1, "number_of_replicas": 1 }, "mappings": { "properties": { "name":{ "type": "text" }, "age":{ "type": "integer" } } } } 不带条件查询所有记录 GET /employee/_search { "query": { "match_all": {} } } 分页查询 GET /employee/_search { "query": { "match_all": {} }, "from": 0, "size": 1 } 带关键字条件的查询(分词器) GET /employee/_search { "query": { "match": { "name": "凯皇" } } } 带排序(打分就没有的) GET /employee/_search { "query": { "match": { "name": "凯皇" } }, "sort": [ { "age": { "order": "desc" } } ] } 带filter(term(不分词)和match(分词)) GET /employee/_search { "query": { "bool": { "filter":[ { "term": { "age":30 } } ] } } } 带聚合 GET /employee/_search { "query": { "match": { "name": "凯" } }, "sort": [ { "age": { "order": "desc" } } ], "aggs": { "group_by_age": { "terms": { "field": "age" } } } } ``` ### 高级语法 * analyze分析过程 ``` standard analyze:分析=分词的过程:字符过滤器(过滤字符)->字符处理(标准字符处理,以空格和标点符号分割内容)->分词过滤(分词转换,变小写) english analyze:分析=分词的过程:字符过滤器(过滤特殊符号外加量词,the等等)->字符处理(以空格和标点符号分割内容)->分词过滤(分词转换,词干转化,去除复数) ``` * 相关性查询手段 ``` //新建一个索引库 PUT /movie/_doc/1 { "name":"Eating an apple a day & keeps the doctor away" } //命中不了上面的句子 GET /movie/_search { "query": {"match": { "name": "eat" }} } //使用analyze api查看分词状态 GET /movie/_analyze { "field": "name", "text": "Eating an apple a day & keeps the doctor away" } DELETE /movie //使用结构化的方式创建索引(指定english分词器) PUT /movie { "settings": { "number_of_shards": 1, "number_of_replicas": 1 }, "mappings": { "properties": { "name":{ "type": "text", "analyzer": "english" } } } } ``` * 类型 |类型|说明| |:---:|:---:| |Text|被分析索引的字符串类型| |Keyword|不能被分析只能被精确匹配的字符串类型| |Date|日期类型,可以配合format一起使用| |long,integer,short,double|数字类型| |boolean|true false| |Array|["one","two"]| |Object|json嵌套| |Ip|192.168.1.1| |Geo_point|地理位置| * Tmdb实例 ``` //使用TMDB海量数据构建复查查询 PUT /movie { "settings": { "number_of_shards": 1, "number_of_replicas": 1 }, "mappings": { "properties":{ "title":{"type": "text","analyzer": "english"}, "tagline":{"type":"text","analyzer": "english"}, "release_date":{"type": "date","format": "8yyyy/MM/dd||yyyy/M/dd||yyyy/MM/d||yyyy/M/d"}, "popularity":{"type": "double"}, "overview":{"type": "text","analyzer": "english"}, "cast":{ "type": "object", "properties": { "character":{"type":"text","analyzer":"standard"}, "name":{"type":"text","analyzer":"standard"} } } } } } 搜索内容 match (按照字段上的分词分析后去索引内查询) GET /movie/_search { "query": { "match": { "title": "steve Zissou" } } } 搜索内容 term (不进行分词的分析,直接去索引内查询) GET /movie/_search { "query": { "term": { "title": "steve" } } } //分词后的and和or的逻辑,match默认使用的是or GET /movie/_search { "query": { "match": { "title": "basketball with cartoom aliens" } } } 改成and GET /movie/_search { "query": { "match": { "title": { "query": "basketball with cartoom aliens", "operator": "and" } } } } 最小词匹配项(minimum_should_match命中两个) GET /movie/_search { "query": { "match": { "title": { "query": "basketball Love Alien", "operator": "or", "minimum_should_match": 2 } } } } 短语查询 GET /movie/_search { "query": { "match_phrase": { "title": "steve zissou" } } } 多字段查询 GET /movie/_search { "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title","overview"] } } } 查看TF/IDF打分过程 GET /movie/_search { "explain": true, "query": { "match": { "title": "steve" } } } 查看多字段打分过程 GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title","overview"] } } } 优化多字段查询(将title的字段得分放大10倍) GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title^10","overview"] } } } (tie_breaker将其他字段的分数*0.3加到总得分中) GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title^10","overview"], "tie_breaker": 0.3 } } } bool查询 must 必须为true must not 必须为false should 其中一个为true 为true的越多得分越高 GET /movie/_search { "explain": true, "query": { "bool": { "should": [ {"match":{"title":"basketball with cartoom aliens"}}, {"match":{"overview":"basketball with cartoom aliens"}} ] } } } 不同的multi_query其实是有不同的type best_fields:默认的得分方式,取得最高的分数作为对应文档的对应分数“最匹配模式” GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title","overview"], "type": "best_fields" } } } dis_max(用最大值打分) GET /movie/_search { "explain": true, "query": { "dis_max": { "queries": [ {"match":{"title":"basketball with cartoom aliens"}}, {"match":{"overview":"basketball with cartoom aliens"}} ] } } } 查看best_fields打分文档 GET /movie/_validate/query?explain { "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title","overview"], "type": "best_fields" } } } 查看best_fields打分文档 GET /movie/_validate/query?explain { "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title^10","overview"], "type": "best_fields" } } } most_fields:考虑绝大多数(所有的)文档的字段得分相加,获得我们想要的结果 GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title^10","overview^0.1"], "type": "most_fields" } } } 查看most_fields打分文档 GET /movie/_validate/query?explain { "query": { "multi_match": { "query": "basketball with cartoom aliens", "fields": ["title^10","overview"], "type": "most_fields" } } } cross_fields: 以分词为单位计算栏位的总分 GET /movie/_search { "explain": true, "query": { "multi_match": { "query": "steve job", "fields": ["title","overview"], "type": "cross_fields" } } } 查看cross_fields打分文档 GET /movie/_validate/query?explain { "query": { "multi_match": { "query": "steve job", "fields": ["title","overview"], "type": "cross_fields" } } } query_string方便的利用 AND OR NOT GET /movie/_search { "query": { "query_string": { "fields": ["title"], "query": "steve AND jobs" } } } filter过滤查询 单条件过滤 GET /movie/_search { "query": { "bool": { "filter": { "term": { "title": "steve" } } } } } 多条件过滤 GET /movie/_search { "query": { "bool": { "filter":[ {"term": {"title": "steve"}}, {"term": {"cast.name": "gaspard"}}, {"range":{"release_date":{"lte":"2015/01/01"}}}, {"range":{"popularity":{"gte":25}}} ] } }, "sort": [ { "popularity": { "order": "desc" } } ] } 带match打分的filter GET /movie/_search { "query":{ "bool":{ "should":[ {"match":{"title":"search"}} ], "filter":[ {"term": {"title": "steve"}}, {"term": {"cast.name": "gaspard"}}, {"range":{"release_date":{"lte":"2015/01/01"}}}, {"range":{"popularity":{"gte":25}}} ] } } } ``` * 查全率:正确的结果有n个,查询出来的正确的有m m/n * 查准率:查出的n个文档有m个正确 m/n * 两者不可兼得,但是可以调整排序 ``` # functionscore 调整最终得分的计算计算结果 GET /movie/_search { "explain": true, "query": { "function_score": { //原始查询得到的oldvalue "query": { "multi_match": { "query": "steve job", "fields": ["title","overview"], "operator": "or", "type": "most_fields" } }, "functions": [ { "field_value_factor": { //对应要调整处理的字段 "field": "popularity", "modifier": "log2p", "factor": 10 } }, { "field_value_factor": { //对应要调整处理的字段 "field": "popularity", "modifier": "log2p", "factor": 5 } } ], "score_mode": "sum", //不同的field value之间相加 "boost_mode": "sum" //最后再与old_value相加 } } } ``` * IK分词器 * IK Analyze : 字符过滤器(过滤特殊符号,量词,停用词)->基于词库词典进行分词 * https://github.com/medcl/elasticsearch-analysis-ik * ik_smart:智能分词法 * ik_max_word:最大分词法 * "analyzer": 构建索引的分词 * "search_analyzer": 查询的分词 * 最佳实践: 索引的时候使用max_word,查询的时候用smart ``` # 测试IK分词器 GET _analyze?pretty { "analyzer": "ik_smart", "text":"中华人民共和国国歌" } GET _analyze?pretty { "analyzer": "standard", "text":"中华人民共和国国歌" } GET _analyze?pretty { "analyzer": "ik_max_word", "text":"中华人民共和国国歌" } ``` ### 门店索引构建 ``` PUT /shop { "settings": { "number_of_replicas": 1, "number_of_shards": 1 }, "mappings": { "properties": { "id":{"type": "integer"}, "name":{"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart"}, "tags":{"type": "text","analyzer": "whitespace","fielddata": true}, "location":{"type":"geo_point"}, "price_per_man":{"type": "integer"}, "category_id":{"type": "integer"}, "category_name":{"type": "keyword"}, "seller_id":{"type": "integer"}, "seller_remark_score":{"type": "double"}, "seller_disable_flag":{"type": "integer"} } } } ``` ### logstash-input-jdbc增量全量的同步 * logstash安装logstash-input-jdbc插件 ``` logstash-plugin install logstash-input-jdbc ``` * 编写配置文件jdbc.conf,jdbc.sql * 启动logstash同步 ``` logstash -f jdbc.conf ``` * 基于LBS计算距离 ``` #带上距离字段查询haversin计算距离expression表达式 GET /shop/_search { "query":{ "match": { "name": "凯悦" } }, "_source": "*", "script_fields": { "distance": { "script": { "source": "haversin(lat,lon,doc['location'].lat,doc['location'].lon)", "lang":"expression", "params": {"lat":31.37,"lon":127.12} } } } } #使用距离排序 GET /shop/_search { "query":{ "match": { "name": "凯悦" } }, "_source": "*", "script_fields": { "distance": { "script": { "source": "haversin(lat,lon,doc['location'].lat,doc['location'].lon)", "lang":"expression", "params": {"lat":31.37,"lon":127.12} } } }, "sort": [ { "_geo_distance": { "location": { "lat": 31.37, "lon": 127.12 }, "order": "asc", "unit": "km", "distance_type": "arc" } } ] } #使用function score解决排序模型 (高斯衰减函数打分) GET /shop/_search { "explain": true, "_source": "*", "script_fields": { "distance": { "script": { "source": "haversin(lat,lon,doc['location'].lat,doc['location'].lon)", "lang": "expression", "params": { "lat": 31.23916171, "lon": 121.48789949 } } } }, "query": { "function_score": { "query": { "bool": { "must": [ { "match": { "name": { "query": "凯悦", "boost": 0.1 } } }, { "term": { "seller_disabled_flag": 0 } } ] } }, "functions": [ { "gauss": { "location": { "origin": "31.23916171,121.48789949", "scale": "100km", "offset": "0km", "decay": 0.5 } }, "weight": 9 }, { "field_value_factor": { "field": "remark_score" }, "weight": 0.2 }, { "field_value_factor": { "field": "seller_remark_score" }, "weight": 0.1 } ], "score_mode": "sum", "boost_mode": "sum" } }, "sort": [ { "_score": { "order": "desc" } } ] } ``` * 低价排序 ``` GET /shop/_search { "explain": true, "_source": "*", "script_fields": { "distance": { "script": { "source": "haversin(lat,lon,doc['location'].lat,doc['location'].lon)", "lang": "expression", "params": { "lat": 31.23916171, "lon": 121.48789949 } } } }, "query": { "function_score": { "query": { "bool": { "must": [ { "match": { "name": { "query": "凯悦", "boost": 0.1 } } }, { "term": { "seller_disabled_flag": 0 } } ] } }, "functions": [ { "field_value_factor": { "field": "price_per_man" }, "weight": 1 } ], "score_mode": "sum", "boost_mode": "replace" } }, "sort": [ { "_score": { "order": "asc" } } ] } ``` * 根据标签聚合 ``` GET /shop/_search { "_source": "*", "script_fields": { "distance": { "script": { "source": "haversin(lat,lon,doc['location'].lat,doc['location'].lon)", "lang": "expression", "params": { "lat": 31.23916171, "lon": 121.48789949 } } } }, "query": { "function_score": { "query": { "bool": { "must": [ { "match": { "name": { "query": "凯悦", "boost": 0.1 } } }, { "term": { "seller_disabled_flag": 0 } }, { "term": { "tags": "落地大窗" } } ] } }, "functions": [ { "field_value_factor": { "field": "price_per_man" }, "weight": 1 } ], "score_mode": "sum", "boost_mode": "replace" } }, "sort": [ { "_score": { "order": "asc" } } ], "aggs": { "group_by_tags": { "terms": { "field": "tags" } } } } ``` ### 定制化词库 * 在ik分词器中添加字典,修改配置文件指向该字典,然后重启ES * 更新索引重新构建分词 ``` POST /shop/_update_by_query { "query":{ "bool":{ "must":[ {"term":{"name":"凯"}}, {"term":{"name":"悦"}} ] } } } ``` * 热更新词库 ``` 修改配置文件指向字典