Card duplicates
One of the stand-out features of Arguflow is how we detect and handle duplicates. In this guide, we’ll look at how Arguflow handles duplicates to achieve maximal-marginal-relevance (MMR).
Our primary competitors like Vectara and Pinecone are flat out missing this feature with no intent to implement it. This is a huge advantage for Arguflow, as it allows us to provide a much better user experience.
What is MMR?
Maximal-marginal-relevance (MMR) is a technique used in information retrieval to select a diverse set of results. It is used to prevent redundancy in search results, and is used in Arguflow to prevent duplicate cards from being returned in search results on a top-level.
Why do we call duplicates "collisions" in Arguflow?
When you vector'ize data with an embedding model, you are converting it into a vector that exists in some n
dimensional space.
The distance between two vectors is then the the "similarity" between the two pieces of data. Vectors can be close enough that you can imagine them actually colliding in the n
dimensional space as a "semantic collision".
In Arguflow, you can set an environment variable called DUPLICATE_DISTANCE_THRESHOLD
to determine how similar two vectors have to be to be considered a duplicate. The default value is 0.95
, which means that if two vectors have a cosine similarity of 0.95 or above, they are considered duplicates.
How Arguflow's business logic for handling duplicates works
1. The CardCollisions model
We create the card_collisions
SQL table in this migration. There are a couple of sub-sequent migrations to edit the table, but the core of the table is created in that migration.
The CardCollisions model is then defined in this file as shown below:
#[derive(Debug, Serialize, Deserialize, Queryable, Selectable, Insertable, Clone)]
#[diesel(table_name = card_collisions)]
pub struct CardCollisions {
pub id: uuid::Uuid,
pub card_id: uuid::Uuid,
pub collision_qdrant_id: Option<uuid::Uuid>,
pub created_at: chrono::NaiveDateTime,
pub updated_at: chrono::NaiveDateTime,
}
Each CardCollisions
row will relate to a CardMetadata
row in the card_metadata
table via the card_id
column. The collision_qdrant_id
column is used to relate the CardCollisions
row to a Qdrant
point in the qdrant
db.
2. When a card is created, we check for collisions
When a card is created, we find the nearest point using a function called first_semantic_result
as follows:
let first_semantic_result = global_unfiltered_top_match_query(embedding_vector.clone())
.await
.map_err(|err| {
ServiceError::BadRequest(format!(
"Could not get semantic similarity for collision check: {}",
err.message
))
})?;
3. If the first_semantic_result
is within the DUPLICATE_DISTANCE_THRESHOLD
of 0.95
, we create a CardCollisions
row and not a Qdrant
point
If the first_semantic_result
is within the DUPLICATE_DISTANCE_THRESHOLD
of 0.95
, we then only create a CardCollisions
row and not a Qdrant
point. This is because we don't want to create a Qdrant
point for a duplicate card. It will worsen search time performance and prevent MMR from working.
The implementation of this logic can be found in card_handler.rs here.
The result in our default search UI
If you search for "Enron’s liquidity is very solid and there is no need for any panic or alarm" on our Enron demo then you will see arrows on the sides of resulting cards as shown below:
These arrows allow you to navigate between cards that are duplicates of each other.
Thanks for reading! If you have any questions, please reach out to us on Twitter, Matrix, or Discord.