Card duplicates

One of the stand-out features of Arguflow is how we detect and handle duplicates. In this guide, we’ll look at how Arguflow handles duplicates to achieve maximal-marginal-relevance (MMR).

Our primary competitors like Vectara and Pinecone are flat out missing this feature with no intent to implement it. This is a huge advantage for Arguflow, as it allows us to provide a much better user experience.

What is MMR?

Maximal-marginal-relevance (MMR) is a technique used in information retrieval to select a diverse set of results. It is used to prevent redundancy in search results, and is used in Arguflow to prevent duplicate cards from being returned in search results on a top-level.

Why do we call duplicates "collisions" in Arguflow?

When you vector'ize data with an embedding model, you are converting it into a vector that exists in some n dimensional space.

The distance between two vectors is then the the "similarity" between the two pieces of data. Vectors can be close enough that you can imagine them actually colliding in the n dimensional space as a "semantic collision".

In Arguflow, you can set an environment variable called DUPLICATE_DISTANCE_THRESHOLD to determine how similar two vectors have to be to be considered a duplicate. The default value is 0.95, which means that if two vectors have a cosine similarity of 0.95 or above, they are considered duplicates.

How Arguflow's business logic for handling duplicates works

1. The CardCollisions model

We create the card_collisions SQL table in this migration. There are a couple of sub-sequent migrations to edit the table, but the core of the table is created in that migration.

The CardCollisions model is then defined in this file as shown below:

#[derive(Debug, Serialize, Deserialize, Queryable, Selectable, Insertable, Clone)]
#[diesel(table_name = card_collisions)]
pub struct CardCollisions {
    pub id: uuid::Uuid,
    pub card_id: uuid::Uuid,
    pub collision_qdrant_id: Option<uuid::Uuid>,
    pub created_at: chrono::NaiveDateTime,
    pub updated_at: chrono::NaiveDateTime,

Each CardCollisions row will relate to a CardMetadata row in the card_metadata table via the card_id column. The collision_qdrant_id column is used to relate the CardCollisions row to a Qdrant point in the qdrant db.

2. When a card is created, we check for collisions

When a card is created, we find the nearest point using a function called first_semantic_result as follows:

let first_semantic_result = global_unfiltered_top_match_query(embedding_vector.clone())
  .map_err(|err| {
          "Could not get semantic similarity for collision check: {}",

3. If the first_semantic_result is within the DUPLICATE_DISTANCE_THRESHOLD of 0.95, we create a CardCollisions row and not a Qdrant point

If the first_semantic_result is within the DUPLICATE_DISTANCE_THRESHOLD of 0.95, we then only create a CardCollisions row and not a Qdrant point. This is because we don't want to create a Qdrant point for a duplicate card. It will worsen search time performance and prevent MMR from working.

The implementation of this logic can be found in here.

The result in our default search UI

If you search for "Enron’s liquidity is very solid and there is no need for any panic or alarm" on our Enron demo then you will see arrows on the sides of resulting cards as shown below:

Arrows on the sides of resulting cards

These arrows allow you to navigate between cards that are duplicates of each other.

Thanks for reading! If you have any questions, please reach out to us on Twitter, Matrix, or Discord.