Euclidean distance (L2): This metric is generally used in the field of computer vision (CV).
Inner product (IP): This metric is generally used in the field of natural language processing (NLP). The metrics that are widely used for binary embeddings include:
Hamming: This metric is generally used in the field of natural language processing (NLP).
Jaccard: This metric is generally used in the field of molecular similarity search.
Tanimoto: This metric is generally used in the field of molecular similarity search.
Superstructure: This metric is generally used to search for similar superstructure of a molecule.
Substructure: This metric is generally used to search for similar substructure of a molecule.
【节点】
query node 负责查询,data node 负责数据写入和长期化、index node负责索引创建和加速查询
the query node is in charge of data query; the data node is responsible for data insertion and data persistence; and the index node mainly deals with index building and query acceleration.
计划原则
The design principles of Milvus 2.0 数据流程 【logs --> log snapshot --> segment】
Both the table and the log are data, In the case of Milvus, it aggregates logs using a processing window from TimeTick. Based on log sequence, multiple logs are aggregated into one small file called log snapshot. Then these log snapshots are combined to form a segment, which can be used individually for load balance. 第三方依赖:MinIO, etcd, and Pulsar.
Milvus cluster includes eight microservice components and three third-party dependencies: MinIO, etcd, and Pulsar. Storage层有三个部分构成:
Meta store、Log broker、Object storage
数据处理过程
Reference
**数据写入和数据长期化:**https://milvus.io/blog/deep-dive-4-data-insertion-and-data-persistence.md 加载数据到内存(实时查询):https://milvus.io/blog/deep-dive-5-real-time-query.md#Load-data-to-query-node 1、数据写入:可以为每个 collection 设置分片数目(每个分片对应一个虚拟通道 vchannel),数据将根据主键的哈希值写入相应的分片。
You can specify a number of shards for each collection in Milvus, each shard corresponding to a virtual channel (vchannel). Any incoming insert/delete request is routed to shards based on the hash value of primary key. 当segment满了,会主动出发data flush (512 MB by default)
**【segment】**Milvus中用于数据存储的最小单元。
indexes are built on segments.
Automatically flush segment data If the segment is full, the data coord automatically triggers data flush.
三种类型的segment
There are three types of segments with different status in Milvus: growing, sealed, and flushed segment.
growing:可以插入数据
sealed:不再插入数据 A sealed segment is a closed segment
flushed :已经被写入磁盘 A flushed segment is a segment that has already been written into disk. A segment can only be flushed when the allocated space in a sealed segment expires.
2、索引创建:index node 从segmeng中将数据加载到内存,创建索引后再写回 object storage
The index node loads the log snapshots to index from a segment (which is in object storage) to memory , deserializes the corresponding data and metadata to build index, serializes the index when index building completes, and writes it back to object storage. 向量检索维度太高,传统基于树的索引不再顺应,取而代之的有基于聚类和基于图的索引。
Vectors cannot be efficiently indexed with traditional tree-based indexes due to their high-dimensional nature, but can be indexed with techniques that are more mature in this subject, such as cluster- or graph-based indexes. 3、查询:query node在加载到内存的segment中执行查询
A collection in Milvus is split into multiple segments, and the query nodes loads indexes by segment. 每个query 节点只负责两个任务:按照 query coord 的指令加载或开释段;在当地段中进行搜刮。
Each node is responsible only for two tasks: Load or release segments following the instructions from query coord; conduct a search within the local segments. 有两种类型的segment: growing segments 和sealed segments
There are two types of segments, growing segments (for incremental data), and sealed segments (for historical data). Query nodes subscribe to vchannel to receive recent updates (incremental data) as growing segments.
两只类型的数据被加载到query node : growing segments 和sealed segments
There are two types of data that are loaded to query node: streaming data from log broker, and historical data from object storage (also called persistent storage below). query node 1从长期化存储中加载历史数据S1,从订阅日志代理中的channel 1加载G1
In query node 1 in the image, historical data (batch data), are loaded via the allocated S1 and S3 from persistent storage. In the meanwhile, query node 1 loads incremental data (streaming data) by subscribing to channel 1 in log broker.
存储布局(逻辑)
Collection
In Milvus, a collection is equivalent to a table in a relational database management system (RDBMS)
数据存储在聚集(Collection)中,每个聚集包含多个实体(Entity),每个实体包含多个字段(Field),但最重要的是向量字段(Vector Field)。 【分组存储】: 将不同类型的数据分别存储在不同的 collections 中。比方,一个用于存储图像特征向量的 collection,另一个用于存储文本特征向量的 collection。
index
D, I = index.search(xb[:5], k) # sanity check 输入为query embeddig和neighbors数
复制代码
API Reference
https://faiss.ai/
https://ai.meta.com/tools/faiss/
https://github.com/facebookresearch/faiss/wiki
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
example Reference
How to Use FAISS to Build Your First Similarity Search-medium
Milvus