# elasticLab

**Repository Path**: pardon110/elastic-lab

## Basic Information

- **Project Name**: elasticLab
- **Description**: es全文检索，RNN/RRF向量检索(ML)，语义搜索
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 2
- **Forks**: 0
- **Created**: 2024-05-06
- **Last Updated**: 2025-01-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# [Elasticsearch Search Tutorial](https://www.elastic.co/search-labs/tutorials/search-tutorial/full-text-search/search-basics)

es全文检索，向量检索(ML)，语义搜索

| 分支名称     | 版本地址                 |
|--------------|--------------------------|
| master      | [全文检索](https://gitee.com/pardon110/elastic-lab/tree/master/)  |
| KNN      | [基本向量查询](https://gitee.com/pardon110/elastic-lab/tree/KNN/) |
| RRF      | [混合检索排名](https://gitee.com/pardon110/elastic-lab/tree/RRF/) |

This directory contains a starter Flask project used in the Search tutorial.

## Docker

```docker
docker run -p 9200:9200 -d --name elasticsearch \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "xpack.security.http.ssl.enabled=false" \
  -e "xpack.license.self_generated.type=trial" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.0

```

## Install dependence

- 方式一

```python
cd search-tutorial

python3 -m venv .venv

source .venv/bin/activate

```

- 方式二

本例使用conda 环境estest, 使用本地docker部署

```python
pip install -r requirements.txt

flask run
```

## package

- `dotenv` 从环境变量读取配置

## Full-Text Search

- install elasticsearch client

```python
pip install elasticsearch

pip freeze > requirements.txt
```

### Create the Index

```python
self.es.indices.create(index='my_documents')
self.es.indices.delete(index='my_documents')

```

Click

- `@app.cli.command()` 装饰器
  - 告知Flask注册一个自定义的命令行命令
  - 其文档字符串作为 --help 说明
  - Flask CLI 是基于 Click
- Click是一个Python库，专为创建命令行工具而设计

### Search Basics

- [Match query](https://)
- Search API Request body
- Search API Response Body
- Pratical BM25 article series

## Usage

```mermaid
graph LR

A[build environment] --> B[import data]
B --> C[create index]
C --> D[visit search]
```

1. Install docker
2. run

    ```python

    flask run

    flask reindex
    ```

3. visit `http://localhost:5001`

## Vector Search

使用预训练模型 generate Embeddings

```python
pip install sentence-transformers

pip freeze > requirements.transformers.txt
```

es支持的存储和处理向量

稠密向量（dense_vector） 和 稀疏向量（sparse_vector）两种向量类型

- 相似度计算方式
  - dot_product 点积相似度
    - 有效率，但需要对向量进行归一化
    - 值越大向量越相似
  - cosine 余弦相似度
    - 用于衡量两个向量之间的夹角的大小，值在-1到1之间，值近1则更相似
  - L2 Similarity（L2范数相似度）
    - L2 欧几里德距离，值越小越相似

1. Loading the Model
2. Generating Embeddings
3. Store Embeddings in Elasticsearch
    - sub-field keyword
    - dynamically field text
    - Adding a Vector Field to the Index
4. 重建索引，增强向量类型字段及相关索引
5. Adding Embeddings to Documents

## Semantic Search

Elastic Learned Sparse EncodeR model （ELSER)

- 语义搜索：ELSER 是一种语义搜索模型，它不仅仅基于关键词匹配进行搜索，而是根据文本的意图或含义进行搜索。
- 文本扩展：当 ELSER 应用于原始文本（如日志消息）时，它会生成一个数据结构。
- 无需微调：ELSER 是一种“域外”（out-of-domain）模型，意味着它不需要针对特定用户的数据进行微调
- 搜索排名：ELSER 也是一种高效的搜索排名解决方案。
- 技术预览：ELSER 目前处于技术预览阶段

ELSER 使用 sparse_vector 类型字段，映射目标模型dense_vector信息特征

1. 建索引映射
2. 集成使用处理器的管道
3. 加载数据
4. 通过管道集成数据