<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[ColiVara: A State of the Art Retrieval API for AI workflows]]></title><description><![CDATA[ColiVara is a State of the Art Retrieval API - with a delightful developer experience. It stores, searches, and retrieves documents based on their visual embedding.]]></description><link>https://blog.colivara.com</link><generator>RSS for Node</generator><lastBuildDate>Sat, 11 Apr 2026 22:26:41 GMT</lastBuildDate><atom:link href="https://blog.colivara.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Unlocking 70% Faster Response Times Through Token Pooling]]></title><description><![CDATA[TLDRThis post examines improvements made to ColiVara, our ColPali-based retrieval API. We focus on hybrid search and hierarchical clustering token pooling. By benchmarking these two approaches, we aim to evaluate their impact on latency and performan...]]></description><link>https://blog.colivara.com/unlocking-70-faster-response-times-through-token-pooling</link><guid isPermaLink="true">https://blog.colivara.com/unlocking-70-faster-response-times-through-token-pooling</guid><category><![CDATA[colpali]]></category><category><![CDATA[AI]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Jonathan Adly]]></dc:creator><pubDate>Mon, 02 Dec 2024 18:31:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733162775029/57ebceda-dc33-4529-b843-74ee91633d2b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<details><summary>TLDR</summary><div data-type="detailsContent">This post examines improvements made to ColiVara, our ColPali-based retrieval API. We focus on hybrid search and hierarchical clustering token pooling. By benchmarking these two approaches, we aim to evaluate their impact on latency and performance.</div></details>

<h2 id="heading-background">Background</h2>
<p>This post examines improvements made to ColiVara, our ColPali-based retrieval API. We focus on hybrid search and hierarchical clustering token pooling. By benchmarking these two approaches, we aim to evaluate their impact on latency and performance.</p>
<p>The conventional approach to handling documents for RAG or data extraction typically involves a multi-stage process: Optical Character Recognition (OCR) to extract text, Layout Recognition to understand the document structure, Figure Captioning to interpret images, Chunking to segment the text, and finally, Embedding to represent each segment in a vector space.</p>
<p>This pipeline is not only complex and computationally demanding but also prone to error propagation. Inaccuracies in any stage, for example, OCR errors or misinterpretation of visual layouts, can significantly degrade the quality of downstream retrieval and generation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733164599821/f4c2710c-ca7b-4cf7-9f67-5a5059cae488.jpeg" alt class="image--center mx-auto" /></p>
<p>A more streamlined approach, as pioneered in the <a target="_blank" href="https://arxiv.org/html/2407.01449v3">ColPali</a> paper, leverages the power of vision models. Instead of complex pre-processing, this method directly embeds entire document pages as images, simplifying the retrieval process to a similarity search on these visual embeddings.</p>
<p>This approach eliminates the need for laborious pre-processing steps and potentially captures richer contextual information from the visual layout and style of the document. This innovative approach forms the foundation of our work in ColiVara.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733164640697/4d30a8c2-95aa-49af-baf3-e05d5073886d.png" alt class="image--center mx-auto" /></p>
<p>This post details our research on <strong>using hierarchical clustering token pooling,</strong> building upon and extending the core ideas presented in ColPali.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A key contribution of ColiVara lies in its API-first design. We prioritizes developer experience and integration into real-world applications. This architecture, however, introduces practical considerations related to network latency and data storage.</div>
</div>

<h2 id="heading-research-question">Research Question</h2>
<p>In our previous <a target="_blank" href="https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision">post</a>, we looked into whether we could speed up our inference by using different similarity calculations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733164819871/8c5ca1ce-c619-4a19-9f74-7bd7bfe4ab0b.avif" alt class="image--center mx-auto" /></p>
<p>We found that the bottleneck is directly linked to the number of embeddings for each document. The more embeddings there are, the longer the calculation takes. This relationship has a quadratic complexity of <code>O(n*m)</code>, where <code>m</code> is the number of document vectors.</p>
<p>The question we wanted to answer was: <strong>Can we maintain the same state-of-the-art performance with either fewer document candidates or a significantly lower embedding count?</strong></p>
<p>The goal was to see if we could improve latency by <strong>reducing the number of documents</strong> (using hybrid search) or the <strong>number of document vectors</strong> (using token pooling) without losing any performance.</p>
<p>Our metric, consistent with the ColPali paper, was the NCDG@5 score in <strong>ArxivQA</strong> and our typical API request latency using:</p>
<ul>
<li><p><strong>Hybrid Keyword</strong> Search with Postgres native search capabilities to reduce the number of candidates for inference</p>
</li>
<li><p><strong>Token pooling</strong> at indexing time to reduce the total number of embeddings by averaging within similarity clusters</p>
</li>
</ul>
<p>The ColPali paper uses late-interaction style computations to determine the relevancy between a query and documents. Here is a simple GitHub gist explaining how late-interaction style computations work: <a target="_blank" href="https://gist.github.com/Jonathan-Adly/cb905cc3958b973d7b3b6f25d9915c39">https://gist.github.com/Jonathan-Adly/cb905cc3958b973d7b3b6f25d9915c39</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733165277073/cfa3328b-bd03-47a6-a61d-1a102dbce351.png" alt class="image--center mx-auto" /></p>
<p>The key point is that late-interactions rely on <strong>multi-vector</strong> representation. Multi-vectors introduce significant storage and memory overhead and are computationally demanding to search through.</p>
<h2 id="heading-baseline-implementation">Baseline implementation</h2>
<p>To get a realistic picture on real-life performance, we set the parameters as close to our production setup at ColiVara as possible. Embeddings were stored in Postgres with a pgVector extensions. Everything ran in an AWS r6g.xlarge (4 core CPU, 32g ram) and called from our python backend code hosted in a VPS.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">As we do things over the network, latency is also affected by where the user is in the globe and their network conditions.</div>
</div>

<p>We use ColQwen2 as the base model and <a target="_blank" href="https://github.com/illuin-tech/colpali">colpali-engine (v0.3.4)</a> to generate embeddings. It improves upon the base implementation of the paper with ~25% less embeddings and better performance. Our work builds on top of those improvements to enhance them further.</p>
<p><strong>Results:</strong></p>
<p>On <a target="_blank" href="https://github.com/taesiri/ArXivQA"><strong>ArxivQA</strong></a> using NCDG@5 and end to end latency we had the following:</p>
<ul>
<li><p>Average NDCG@5 score: 0.88</p>
</li>
<li><p>Average latency: 3.58 seconds</p>
</li>
</ul>
<p>The dataset is composed of 500 pages. This score matches the leader of the <a target="_blank" href="https://huggingface.co/spaces/vidore/vidore-leaderboard">Vidore leaderboard</a> and considered state of the art retrieval.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">DCG is a measure of relevance that considers the position of relevant results in the returned list. It assigns higher scores to results that appear earlier. Normalized Discounted Cumulative Gain normalizes DCG by dividing it by the ideal DCG (IDCG) for a given query, providing a score between 0 and 1. In this project, we calculate NDCG@5 to evaluate the top 5 search results for each query.</div>
</div>

<h2 id="heading-hybrid-search">Hybrid search</h2>
<p>Having run a few large RAG applications before, we were well aware of the power of hybrid search. The details of the implementation can vary, but the main idea is to quickly narrow down your candidate documents using Postgres search capabilities, then re-rank them with more advanced semantic search techniques.</p>
<p>In our implementation, we used gemini-flash-8B to create captions and add keywords to each document during indexing. At query time, we used the same LLM to convert the query into keywords, then employed standard Postgres search with a GIN index to retrieve the top 25 documents.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">There are lots of variable parameters in a hybrid search implementation. As our goal was to improve latency we were intentional about using the fastest reasonable multimodal LLM at query time. The lowest reasonable count of candidate. And out of the box standard Postgres settings.</div>
</div>

<p><strong>Results:</strong></p>
<p>On <a target="_blank" href="https://github.com/taesiri/ArXivQA"><strong>ArxivQA</strong></a> using NCDG@5 and end to end latency we had the following:</p>
<ul>
<li><p>Latency: 2.65</p>
</li>
<li><p>Score: 0.68</p>
</li>
</ul>
<p>The good news is that we reduced our latency by about 25%. However, our performance decreased significantly. We believe that by adjusting all the parameters of the hybrid search implementation, we can improve performance. However, it will involve a trade-off between latency and performance. For instance, using a larger model like Qwen2-VL 72B for keyword extraction might enhance performance, but it will also be slower. Similarly, indexing the full text of the documents instead of just using captions or keywords might be better, but it would also slow things down.</p>
<h2 id="heading-hierarchical-clustering-token-pooling"><strong>Hierarchical Clustering Token</strong> Pooling</h2>
<p>The next optimization we wanted to test is <strong>hierarchical clustering token pooling</strong>. Despite the complex name, it's actually quite simple. You take your embeddings and average them in <strong>clusters of similarity</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1733163945750/2784009d-3031-4f85-8419-bae8ffc8d1e4.png" alt class="image--center mx-auto" /></p>
<p>You start with a desired compression level, called the pooling factor. At a pooling factor of 1, you keep your embeddings as they are. With a factor of 2, you reduce your embeddings by half. At a pooling factor of 3, you save only a third of your original embeddings.</p>
<p>We got the idea from this excellent <a target="_blank" href="https://www.answer.ai/posts/colbert-pooling.html">post</a> by Answer.AI, which tried this approach using ColBert and found success. The optimal point seemed to be a pooling factor of 3, so we chose that.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The most common implementation of pooling is to just take everything and average it into a single-vector. Like Answer.AI - we are very skeptical that this is a good approach. <strong>Not all tokens are created equal</strong>. Single-vector pooling averages the important token with unimportant token into a single representation.</div>
</div>

<p>The implementation to do this is ~10 lines of code. A standalone function that pools your embedding at index time or even after the fact.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">pool_embeddings</span>(<span class="hljs-params">embeddings: torch.Tensor, pool_factor: int = <span class="hljs-number">3</span></span>) -&gt; List[List[float]]:</span>
    <span class="hljs-string">"""
    Reduces number of embeddings by clustering similar ones together.

    Args:
        embeddings: Single image embeddings of shape (1038, 128)
                   Example with 4 vectors, 3 dimensions for simplicity:
                   [[1,0,1],
                    [1,0,1],
                    [0,1,0],
                    [0,1,0]]
    """</span>
    <span class="hljs-comment"># Step 1: Calculate similarity between all vectors</span>
    <span class="hljs-comment"># For our example above, this creates a 4x4 similarity matrix:</span>
    <span class="hljs-comment"># [[1.0  1.0  0.0  0.0],    # Token 1 compared to all tokens (same, same, different, different)</span>
    <span class="hljs-comment">#  [1.0  1.0  0.0  0.0],    # Token 2 compared to all tokens</span>
    <span class="hljs-comment">#  [0.0  0.0  1.0  1.0],    # Token 3 compared to all tokens</span>
    <span class="hljs-comment">#  [0.0  0.0  1.0  1.0]]    # Token 4 compared to all tokens</span>
    <span class="hljs-comment"># High values (1.0) mean tokens are very similar</span>
    similarities = torch.mm(embeddings, embeddings.t())

    <span class="hljs-comment"># Step 2: Convert to distances (1 - similarity)</span>
    <span class="hljs-comment"># For our example:</span>
    <span class="hljs-comment"># [[0.0  0.0  1.0  1.0],    # Now low values mean similar</span>
    <span class="hljs-comment">#  [0.0  0.0  1.0  1.0],    # 0.0 = identical</span>
    <span class="hljs-comment">#  [1.0  1.0  0.0  0.0],    # 1.0 = completely different</span>
    <span class="hljs-comment">#  [1.0  1.0  0.0  0.0]]</span>
    distances = <span class="hljs-number">1</span> - similarities.cpu().numpy()

    <span class="hljs-comment"># Step 3: Calculate target number of clusters</span>
    <span class="hljs-comment"># For our example with pool_factor=2:</span>
    <span class="hljs-comment"># 4 tokens → 2 clusters</span>
    target_clusters = max(embeddings.shape[<span class="hljs-number">0</span>] // pool_factor, <span class="hljs-number">1</span>)

    <span class="hljs-comment"># Step 4: Perform hierarchical clustering</span>
    <span class="hljs-comment"># This groups similar tokens together</span>
    <span class="hljs-comment"># For our example, cluster_labels would be:</span>
    <span class="hljs-comment"># [1, 1, 2, 2]  # Tokens 1&amp;2 in cluster 1, Tokens 3&amp;4 in cluster 2</span>
    clusters = linkage(distances, method=<span class="hljs-string">"ward"</span>)
    cluster_labels = fcluster(clusters, t=target_clusters, criterion=<span class="hljs-string">"maxclust"</span>)

    <span class="hljs-comment"># Step 5: Average embeddings within each cluster</span>
    <span class="hljs-comment"># For our example:</span>
    <span class="hljs-comment"># Cluster 1 average = [1,0,1] and [1,0,1] → [1,0,1]</span>
    <span class="hljs-comment"># Cluster 2 average = [0,1,0] and [0,1,0] → [0,1,0]</span>
    <span class="hljs-comment"># Final result: [[1,0,1], [0,1,0]]</span>
    pooled = []
    <span class="hljs-keyword">for</span> cluster_id <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, target_clusters + <span class="hljs-number">1</span>):
        mask = cluster_labels == cluster_id
        cluster_embeddings = embeddings[mask]
        cluster_mean = cluster_embeddings.mean(dim=<span class="hljs-number">0</span>)
        pooled.append(cluster_mean.tolist())

    <span class="hljs-keyword">return</span> pooled
</code></pre>
<p>This method of pooling is quite elegant and brilliant. Where - nothing else changes except the number of embeddings being generated.</p>
<p><strong>Results:</strong></p>
<p>On <a target="_blank" href="https://github.com/taesiri/ArXivQA"><strong>ArxivQA</strong></a> using NCDG@5 and end to end latency we had the following:</p>
<ul>
<li><p>Latency: 2.13</p>
</li>
<li><p>Score: 0.87</p>
</li>
</ul>
<p>This was as close to a free lunch as you will ever get. The storage cost went down by 66%, latency improved by about 40%, and there was very little performance loss. It was magnificent and beautiful, the kind of optimization that is usually only theoretical.</p>
<p>We decided to implement it in production and ran the entire evaluation suite. The final results were even more impressive, with up to 70% better latency on larger document collections and very minimal loss. You can see the full <a target="_blank" href="https://github.com/tjmlabs/colivara-eval">results</a> here.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We tested and experimented with hybrid search and token pooling to improve our API's latency without sacrificing performance. Hybrid search improved latency by narrowing down results with keywords, but it reduced performance. On the other hand, hierarchical clustering token pooling greatly improved storage efficiency and latency with minimal performance loss.</p>
]]></content:encoded></item><item><title><![CDATA[From Cosine to Dot: Benchmarking Similarity Methods for Speed and Precision]]></title><description><![CDATA[Background
Retrieval Augmented Generation (RAG) empowers large language models (LLMs) by integrating private documents and proprietary knowledge, unlocking their potential for nuanced and informed responses. However, efficiently extracting informatio...]]></description><link>https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision</link><guid isPermaLink="true">https://blog.colivara.com/from-cosine-to-dot-benchmarking-similarity-methods-for-speed-and-precision</guid><category><![CDATA[AI]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[search]]></category><dc:creator><![CDATA[Jonathan Adly]]></dc:creator><pubDate>Wed, 20 Nov 2024 15:25:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732044793701/e7329077-d762-4c80-a839-5f16aa932f76.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-background">Background</h2>
<p>Retrieval Augmented Generation (RAG) empowers large language models (LLMs) by integrating private documents and proprietary knowledge, unlocking their potential for nuanced and informed responses. However, efficiently extracting information from unstructured documents—especially those with complex visual layouts—remains a significant challenge.</p>
<p>The conventional approach to handling visually dense documents typically involves a multi-stage process: Optical Character Recognition (OCR) to extract text, Layout Recognition to understand the document structure, Figure Captioning to interpret images, Chunking to segment the text, and finally, Embedding to represent each segment in a vector space. This pipeline is not only complex and computationally demanding but also prone to error propagation. Inaccuracies in any stage, for example, OCR errors or misinterpretation of visual layouts, can significantly degrade the quality of downstream retrieval and generation.</p>
<p>A more streamlined approach, as pioneered in the <a target="_blank" href="https://arxiv.org/html/2407.01449v3">ColPali</a> paper, leverages the power of vision models. Instead of complex pre-processing, this method directly embeds entire document pages as images, simplifying the retrieval process to a similarity search on these visual embeddings. This approach eliminates the need for laborious pre-processing steps and potentially captures richer contextual information from the visual layout and style of the document. This innovative approach forms the foundation of our work in ColiVara.</p>
<p>This post details our research on <strong>optimizing and benchmarking various similarity calculation methods,</strong> building upon and extending the core ideas presented in ColPali.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A key contribution of ColiVara lies in its API-first design. We prioritizes developer experience and integration into real-world applications. This architecture, however, introduces practical considerations related to network latency and data storage.</div>
</div>

<p>We store vector embeddings in a PostgreSQL database, employing a one-to-many relationship between vectors and their corresponding pages.</p>
<h2 id="heading-research-question">Research Question</h2>
<p>The ColPali paper uses the late-interaction style computations to calculate the relevancy between a query and documents. The question we wanted to answer, <strong>could we get the same results using other similarity computations with better latency?</strong></p>
<p>Our metric, to stay consistent with the paper was the NCDG@5 score in DocVQA and <strong>our</strong> typical API request latency using:</p>
<ul>
<li><p>Late-interaction Cosine Similarity</p>
</li>
<li><p>Hamming distance with Binary Quantization of vectors</p>
</li>
<li><p>Hamming distance as above with late-interaction re-ranking</p>
</li>
</ul>
<p>The <strong>core idea in simple terms</strong> in the ColPali paper implementation of late-interaction is:</p>
<ul>
<li><p>We have <strong>query embeddings</strong>: an array with <code>n</code> vectors, each of size 128 (floats).</p>
</li>
<li><p>We have <strong>document embeddings</strong>: an array with 1038 vectors, each of size 128 (floats).</p>
</li>
<li><p>For <strong>each query vector</strong> (n), we find the <strong>most similar document vector</strong> by computing the <strong>dot product</strong>. Here is simple code for straight-forward dot product similarity.</p>
<pre><code class="lang-python">  <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

  <span class="hljs-comment"># Define the query vector</span>
  query = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

  <span class="hljs-comment"># Define the document vectors</span>
  document_a = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>])
  document_b = np.array([<span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">0</span>])

  <span class="hljs-comment"># Compute the dot product between the query and each document</span>
  dot_product_a = np.dot(query, document_a)  <span class="hljs-comment"># 1*4 + 2*5 + 3*6 = 32</span>
  dot_product_b = np.dot(query, document_b)  <span class="hljs-comment"># 1*7 + 2*8 + 3*0 = 23</span>

  <span class="hljs-comment"># Output the results</span>
  print(<span class="hljs-string">f"Dot product between query and document a: <span class="hljs-subst">{dot_product_a}</span>"</span>)
  print(<span class="hljs-string">f"Dot product between query and document b: <span class="hljs-subst">{dot_product_b}</span>"</span>)
  <span class="hljs-comment"># document a is more similar than document b to our query</span>
</code></pre>
</li>
<li><p><strong>Late interaction</strong> works like this: take each <strong>query vector</strong>, find which <strong>document vector</strong> it matches best with (using dot product), and add up all these best matches. Here's a simple example that shows exactly how this works:</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># One query generates multiple vectors (each vector is 128 floats)</span>
query_vector_1 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])  <span class="hljs-comment"># simplified from 128 floats</span>
query_vector_2 = np.array([<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>])  <span class="hljs-comment"># simplified from 128 floats</span>
query = [query_vector_1, query_vector_2]  <span class="hljs-comment"># one query -&gt; n vectors</span>

<span class="hljs-comment"># One document generates multiple vectors (each vector is 128 floats)</span>
document_vector_1 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>])  <span class="hljs-comment"># simplified from 128 floats</span>
document_vector_2 = np.array([<span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">0</span>])  <span class="hljs-comment"># simplified from 128 floats</span>
document_vector_3 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>])  <span class="hljs-comment"># simplified from 128 floats</span>
document = [document_vector_1, document_vector_2, document_vector_3]  <span class="hljs-comment"># one document -&gt; 1038 vectors</span>

<span class="hljs-comment"># For each vector in our query, find its maximum dot product with ANY vector in the document</span>
<span class="hljs-comment"># Then sum these maximums</span>

<span class="hljs-comment"># First query vector against ALL document vectors</span>
dot_products_vector1 = [
    np.dot(query_vector_1, document_vector_1),  <span class="hljs-comment"># 1*4 + 2*5 + 3*6 = 32</span>
    np.dot(query_vector_1, document_vector_2),  <span class="hljs-comment"># 1*7 + 2*8 + 3*0 = 23</span>
    np.dot(query_vector_1, document_vector_3)   <span class="hljs-comment"># 1*1 + 2*1 + 3*1 = 6</span>
]
max_similarity_vector1 = max(dot_products_vector1)  <span class="hljs-comment"># 32</span>

<span class="hljs-comment"># Second query vector against ALL document vectors</span>
dot_products_vector2 = [
    np.dot(query_vector_2, document_vector_1),  <span class="hljs-comment"># 0*4 + 1*5 + 1*6 = 11</span>
    np.dot(query_vector_2, document_vector_2),  <span class="hljs-comment"># 0*7 + 1*8 + 1*0 = 8</span>
    np.dot(query_vector_2, document_vector_3)   <span class="hljs-comment"># 0*1 + 1*1 + 1*1 = 2</span>
]
max_similarity_vector2 = max(dot_products_vector2)  <span class="hljs-comment"># 11</span>

<span class="hljs-comment"># Final similarity score is the sum of maximum similarities</span>
final_score = max_similarity_vector1 + max_similarity_vector2  <span class="hljs-comment"># 32 + 11 = 43</span>

print(<span class="hljs-string">f"Final similarity score: <span class="hljs-subst">{final_score}</span>"</span>)
</code></pre>
<p>Now - as you can imagine, this is computationally heavy at scale (or so we thought!). So, we looked for ways to see if we can make it faster and maybe even better.</p>
<h2 id="heading-baseline-implementation">Baseline implementation</h2>
<p>To get a realistic picture on real-life performance, we set the parameters as close to our production setup at ColiVara as possible. Embeddings were stored in Postgres with a pgVector extensions. Everything ran in an AWS r6g.xlarge (4 core CPU, 32g ram) and called from our python backend code hosted in a VPS.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">As we do things over the network, latency is also affected by where the user is in the globe and their network conditions.</div>
</div>

<p>We implemented the paper late-interaction calculation as a Postgres function as such:</p>
<pre><code class="lang-sql"> <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> <span class="hljs-keyword">replace</span> <span class="hljs-keyword">FUNCTION</span> max_sim(<span class="hljs-keyword">document</span> halfvec[],
                                   <span class="hljs-keyword">query</span> halfvec[]) <span class="hljs-keyword">returns</span> <span class="hljs-keyword">DOUBLE</span> <span class="hljs-keyword">PRECISION</span>
<span class="hljs-keyword">AS</span>
  $$ <span class="hljs-keyword">WITH</span> queries <span class="hljs-keyword">AS</span>
  (
         <span class="hljs-keyword">SELECT</span> row_number() <span class="hljs-keyword">OVER</span> () <span class="hljs-keyword">AS</span> query_number,
                *
         <span class="hljs-keyword">FROM</span>   (
                       <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">unnest</span>(<span class="hljs-keyword">query</span>) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">query</span>) ), documents <span class="hljs-keyword">AS</span>
  (
         <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">unnest</span>(<span class="hljs-keyword">document</span>) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">document</span> ), similarities <span class="hljs-keyword">AS</span>
  (
             <span class="hljs-keyword">SELECT</span>     query_number,
                        (<span class="hljs-keyword">document</span> &lt;<span class="hljs-comment">#&gt; query) * -1 AS similarity</span>
             <span class="hljs-keyword">FROM</span>       queries
             <span class="hljs-keyword">CROSS</span> <span class="hljs-keyword">JOIN</span> documents ), max_similarities <span class="hljs-keyword">AS</span>
  (
           <span class="hljs-keyword">SELECT</span>   <span class="hljs-keyword">max</span>(similarity) <span class="hljs-keyword">AS</span> max_similarity
           <span class="hljs-keyword">FROM</span>     similarities
           <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> query_number )<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">Sum</span>(max_similarity)
  <span class="hljs-keyword">FROM</span>   max_similarities;
</code></pre>
<p>The magic line here is <code>(document &lt;#&gt; query) * -1</code>. The &lt;#&gt; symbol is from <a target="_blank" href="https://github.com/pgvector/pgvector">pgVector</a> and we multiply with -1 as gives negative inner product by default. We also use <a target="_blank" href="https://blog.colivara.com/optimizing-vector-storage-with-halfvecs">halfvecs to store embeddings</a> for efficiency.</p>
<p><strong>Results:</strong></p>
<p>On <a target="_blank" href="https://www.docvqa.org/">DocVQA</a> using NCDG@5 and end to end latency we had the following:</p>
<ul>
<li><p>Average NDCG@5 score: 0.55</p>
</li>
<li><p>Average latency: 3.1 seconds</p>
</li>
</ul>
<p>The dataset is composed of 500 pages.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">DCG is a measure of relevance that considers the position of relevant results in the returned list. It assigns higher scores to results that appear earlier. Normalized Discounted Cumulative Gain normalizes DCG by dividing it by the ideal DCG (IDCG) for a given query, providing a score between 0 and 1. In this project, we calculate NDCG@5 to evaluate the top 5 search results for each query.</div>
</div>

<h2 id="heading-late-interaction-cosine-similarity">Late-Interaction Cosine Similarity</h2>
<p>Having ran a few large RAG applications before, we were intimately familiar with how <strong>cosine similarity</strong> works. Cosine similarity normalizes for vector length and gives a few nice additions.</p>
<ul>
<li><p>Values are always between 0 and 1</p>
</li>
<li><p>This means vectors pointing in the same direction but with different magnitudes will be considered similar</p>
</li>
<li><p>It is more computationally more expensive than dot product as it needs to calculate vector magnitudes (requires square roots) but also more performant</p>
</li>
</ul>
<p>Our first attempt was meant to establish what would happen if we kept everything the same, but substituted dot product with cosine similarity.</p>
<p>So, we modified our function by replacing the <strong>dot product</strong> with <strong>cosine similarity</strong>.</p>
<p>Before:</p>
<pre><code class="lang-sql">(document &lt;<span class="hljs-comment">#&gt; query) * -1 AS similarity</span>
</code></pre>
<p>After:</p>
<pre><code class="lang-sql">1 - (document &lt;=&gt; query) AS similarity
</code></pre>
<p>Then - we ran our evaluations again.</p>
<p><strong>Results:</strong></p>
<ul>
<li><p>Average NDCG@5 score: 0.55</p>
</li>
<li><p>Average latency: 3.25 seconds</p>
</li>
</ul>
<p>So - our latency went up, but our scores didn’t really improve. It was good to see though that the main latency-driver is not the math behind the calculation. We could use it, and will likely be okay. However, both the latency benefits and the improvements are marginal.</p>
<h2 id="heading-binary-quantization-and-hamming-distance">Binary Quantization and Hamming Distance</h2>
<p>Binary quantization is a compression technique for vector databases. It converts 32-bit floating point numbers into binary values (1 bit), achieving significant memory savings. The conversion process typically preserves the relative relationships between vectors while sacrificing some precision. For example:</p>
<pre><code class="lang-python">embeddings = [<span class="hljs-number">0.123</span>, <span class="hljs-number">-0.456</span>, <span class="hljs-number">0.789</span>, <span class="hljs-number">-0.012</span>]
quantized_embeddings = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]
</code></pre>
<p>To measure similarity when we have bits as our embeddings, we use the <a target="_blank" href="https://en.wikipedia.org/wiki/Hamming_distance">hamming distance</a>. Hamming distance is a measure of the number of differing bits between two binary strings of equal length. It's calculated by comparing the corresponding bits of the two strings and counting the positions at which the bits differ.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">For example, the Hamming distance between '1010' and '1100' is 2 because there are two positions where the bits differ.</div>
</div>

<p>First - we converted all our embeddings to be binary and stored them in their own column. Then - we changed our similarity function to be as the <a target="_blank" href="https://github.com/pgvector/pgvector-python/blob/master/examples/colpali/exact.py">pgVector-python</a> documentation recommends.</p>
<pre><code class="lang-sql">1 - ((document &lt;~&gt; query) / bit_length(query)
</code></pre>
<p>Finally - we ran our evaluations.</p>
<p><strong>Results:</strong></p>
<p>Average NDCG@5 score: 0.54</p>
<p>Average latency: 3.25 seconds</p>
<p>The good news is that the precision loss is minimal. We are hesitant to embrace it fully even with these results, as the loss can be unpredictable. This is essentially a measure of bit diversity, which partly depends on the data being embedded—so your mileage may vary.</p>
<p>The interesting tidbit here is latency is slower, as slow as Cosine similarity. We can almost conclude that if we want to make the process faster, what similarity calculation we should use is probably irrelevant.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The main benefit of Binary Quantization is storage cost. It significantly cuts down how much space your embeddings take. At larger datasets, the precision loss is probably worth the savings.</div>
</div>

<h2 id="heading-binary-quantization-with-re-ranking-w-100-documents">Binary Quantization with Re-ranking w/ 100 documents</h2>
<p>The final evaluation we wanted to conduct before wrapping up this experiment was to see if we could use Hamming distance and then re-rank using dot product late interactions. If our theory is correct—that the type of similarity calculation doesn't matter much—we should expect a slight increase in latency (since we're doing two calculations) and no significant improvement in performance.</p>
<p>That's exactly what happened.</p>
<p><strong>Results:</strong></p>
<ul>
<li><p>Average NDCG@5 score: 0.55</p>
</li>
<li><p>Average latency: 3.5 seconds</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732044784431/81983c70-c737-4a0a-abc3-a2697e1b61f0.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-bottleneck">The bottleneck</h2>
<p>Postgres is highly optimized, so adding a square root or doing a bit of extra multiplication has a minimal impact. It's an <strong>obvious realization</strong> in hindsight, but without conducting evaluations and this experiment, we wouldn't have been able to prove it.The real bottleneck is here:</p>
<pre><code class="lang-sql">FROM queries CROSS JOIN documents
</code></pre>
<p>the CROSS JOIN is <strong>the major bottleneck</strong> ! Let's break down the scale:</p>
<p>For each comparison:</p>
<ul>
<li><p>Query: n vectors</p>
</li>
<li><p>Document: 1038 vectors per page</p>
</li>
<li><p>Corpus: 500 pages</p>
</li>
</ul>
<p>So for <strong>ONE</strong> query against <strong>ONE</strong> page we're doing: <code>n * 1038 vector comparisons (CROSS JOIN)</code></p>
<p>And for the full 500 pages corpus: <code>n * 1038 * 500 vector comparisons</code></p>
<p>This creates a massive cartesian product. For example, if n=10 (for query):</p>
<ul>
<li><p>10,380 comparisons per page</p>
</li>
<li><p>5,190,000 comparisons for 500 pages</p>
</li>
</ul>
<p>The CROSS JOIN is creating this quadratic complexity <code>O(n*m)</code> where:</p>
<ul>
<li><p>n = number of query vectors</p>
</li>
<li><p>m = number of document vectors</p>
</li>
</ul>
<p>So - the math behind the comparisons is <strong>relatively</strong> inconsequential. The real optimization has to happen before we reach this point.</p>
<h2 id="heading-what-can-be-done">What can be done?</h2>
<p>The crucial optimization insight that we got out of this experiment, is to hyper-focus on the number of pages/documents we are running through our similarity function. Nothing else really matters.</p>
<p>if instead of <code>CROSS JOIN documents (1038 vectors * 500 pages)</code>, we first filter:</p>
<pre><code class="lang-sql">CROSS JOIN (
    <span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> documents 
    <span class="hljs-keyword">WHERE</span> page_id <span class="hljs-keyword">IN</span> (
        <span class="hljs-comment">-- something to pre-filter to maybe 10-20 relevant pages</span>
        <span class="hljs-comment">-- instead of all 500</span>
    )
)
</code></pre>
<p>All of the sudden, we gain massive performance gains. The insight here is for this specific problem: <code>1038 * 10</code> <em>pages is</em> <strong><em>so</em></strong> <strong><em>much better</em></strong> <em>than</em> <code>1038 * 500</code> <em>pages.</em></p>
<p>It's like finding a book in a library:</p>
<ul>
<li><p>First narrow to the right section (cheap and quick search)</p>
</li>
<li><p>Then look for the exact content (expensive vector similarity)</p>
</li>
</ul>
<p>We currently use and recommend <a target="_blank" href="https://docs.colivara.com/guide/filtering">advanced filters</a> in ColiVara to reduce the document set. When a user uploads a document, they can add arbitrary JSONs as metadata. It can be very complex or simple. At query time, they can use filtering to reduce the corpus size. This is a powerful technique, but can be complex to setup.</p>
<h2 id="heading-future-work">Future work</h2>
<p>There is a couple of approaches we planning to tackle to reduce the number of documents without any effort from the user side. The first is a few experiments using Postgres full text search. This will take us away from vision models, but could be straightforward win.</p>
<p>The second is to use vision LLMs generated metadata of documents to store semantic information about the documents automatically at indexing. We can then use existing filtering to reduce the corpus size via that metadata.</p>
<p>The main problem vision models solve in our experience, is that traditional OCR pipelines can’t capture the <strong>full</strong> visual cues in documents. Here, we don’t want to capture everything, just enough to narrow the search and then vision models will do the rest.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We benchmarked and experimented with several similarity calculations setups for our RAG API using vision models. The differences are marginal and doesn’t meaningfully affect performance or latency.</p>
<p>Due to the large size of embeddings produced, the best performance optimization is to reduce the number of documents that go through the similarity calculation, rather than optimize that similarity calculation itself.</p>
]]></content:encoded></item><item><title><![CDATA[Optimizing Vector Storage with halfvecs]]></title><description><![CDATA[RAG (Retrieval Augmented Generation) is a powerful technique that allows us to enhance large language models (LLMs) output with private documents and proprietary knowledge that is not available elsewhere. For example, a company's internal documents o...]]></description><link>https://blog.colivara.com/optimizing-vector-storage-with-halfvecs</link><guid isPermaLink="true">https://blog.colivara.com/optimizing-vector-storage-with-halfvecs</guid><category><![CDATA[AI]]></category><category><![CDATA[pgvector]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Jonathan Adly]]></dc:creator><pubDate>Thu, 14 Nov 2024 15:47:32 GMT</pubDate><content:encoded><![CDATA[<p>RAG (Retrieval Augmented Generation) is a powerful technique that allows us to enhance large language models (LLMs) output with private documents and proprietary knowledge that is not available elsewhere. For example, a company's internal documents or a researcher's notes.</p>
<p>There are many ways to give relevant context to LLMs in your RAG system. We can use a simple keyword search in your database or more advanced search algorithms like BM25, which go beyond keyword search. Here is an example of a simple keyword search.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *
<span class="hljs-keyword">FROM</span> articles
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">content</span> <span class="hljs-keyword">LIKE</span> <span class="hljs-string">'%keyword%'</span>;
</code></pre>
<p>A step further, we can use pretrained language models to create embeddings that provide a lot of information through high dimensionality. Here is a simple example from the SentenceTransformers excellent library.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sentence_transformers <span class="hljs-keyword">import</span> SentenceTransformer

<span class="hljs-comment"># 1. Load model</span>
model = SentenceTransformer(<span class="hljs-string">"all-MiniLM-L6-v2"</span>)

<span class="hljs-comment"># 2. Our documents and query</span>
documents = [
    <span class="hljs-string">"Python is great for programming"</span>,
    <span class="hljs-string">"I have a pet dog"</span>,
    <span class="hljs-string">"The weather is sunny today"</span>
]
query = <span class="hljs-string">"How to program in Python?"</span>

<span class="hljs-comment"># 3. Calculate embeddings</span>
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

<span class="hljs-comment"># 4. Find similarities</span>
similarities = model.similarity(query_embedding, doc_embeddings)
print(<span class="hljs-string">f"Most similar document: <span class="hljs-subst">{documents[similarities.argmax()]}</span>"</span>)
<span class="hljs-comment"># Python is great for programming</span>
</code></pre>
<p>In recent years, pretrained language models have greatly improved text embedding models. However, in our experience, the main challenge for efficient document retrieval is not the performance of the embedding model but the earlier data ingestion process.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The goal of RAG is to provide LLMs with relevant context to enhance their performance. This is the essence of RAG, and it can be tailored to be as simple or complex as the specific use-case demands.</div>
</div>

<p>The process of turning documents into passages of text via OCR, chunking, and a complex pipeline of data cleaning is fragile and error prone. For one of our projects doing RAG over clinical trials, we lost over 30% of the context during the process.</p>
<h2 id="heading-colpali">ColPali</h2>
<p>One advanced technique to improve this process is a retrieval model architecture called <a target="_blank" href="https://arxiv.org/abs/2407.01449">ColPali</a>. It uses the document understanding abilities of recent <strong>Vision</strong> Language Models to create embeddings directly from images of document pages. ColPali significantly outperforms modern document retrieval pipelines while being much faster.</p>
<p>One of the trade-offs of this new retrieval method is that while "late interaction" allows for more detailed matching between specific parts of the query and the potential context, it requires more computing power than simple vector comparisons and produces up to 100 times more embeddings per page.</p>
<p>These trade-offs are often worthwhile in highly visual documents and situations where accuracy is crucial.</p>
<p>Here, we would highlight one of our many optimization in ColiVara, where we leveraged halfvecs as our preferred method of <strong>Scalar quantization.</strong></p>
<h2 id="heading-colivara">ColiVara</h2>
<p><a target="_blank" href="https://colivara.com">ColiVara</a> is a state of the art retrieval API that stores, searches, and retrieves documents based on their visual embedding. End to end it uses vision models instead of chunking and text-processing for documents.</p>
<p>In simple terms, we ask the AI models to "see" and reason, rather "read", and reason. From the user's perspective, it functions like retrieval augmented generation (RAG) but uses vision models instead of chunking and text-processing for documents.</p>
<p>It is a web-first implementation of the <a target="_blank" href="https://arxiv.org/abs/2407.01449">ColPali: Efficient Document Retrieval with Vision Language Models</a> paper.</p>
<p>Like many AI/ML RAG systems, we create and store vectors when we save a user’s document. Since we use ColPali under the hood, each page generate an embeddings that looks like this.</p>
<pre><code class="lang-python"><span class="hljs-comment"># List of 1030 members, each a list of 128 floats per page</span>
embeddings = [[<span class="hljs-number">0.1</span>, <span class="hljs-number">0.2</span>, ..., <span class="hljs-number">0.128</span>], [<span class="hljs-number">0.1</span>, <span class="hljs-number">0.2</span>, ...]]
</code></pre>
<p>Let's calculate the storage requirements for this:</p>
<ol>
<li><p>Each float is 4 bytes.</p>
</li>
<li><p>Each embedding has 128 dimensions, so: 128 * 4 bytes = 512 bytes per embedding.</p>
</li>
<li><p>Total embeddings: 1030.</p>
</li>
<li><p>Total storage: 1030 * 512 bytes = 527,360 bytes ≈ 515 KB per page.</p>
</li>
</ol>
<p>If we have a 100-page document and a collection of 100 documents, then:</p>
<ol>
<li><p>515 KB * 100 pages = 51.5 MB per document.</p>
</li>
<li><p>51.5 MB * 100 documents = 5.15 GB per collection.</p>
</li>
</ol>
<p>This calculation is just for the raw numerical data. Actual memory usage in Python might be slightly higher due to Python's object overhead and list structure. ~5 GB per collection is manageable, but not exactly lightweight. So, we explored different quantization methods to better manage our resource usage.</p>
<h2 id="heading-quantization">Quantization</h2>
<p>There are three common quantization techniques around vector databases:</p>
<ul>
<li><p><strong>Scalar quantization</strong>, which reduces the overall size of the dimensions to a smaller data type (e.g. a 4-byte float to a 2-byte float or 1-byte integer).</p>
</li>
<li><p><strong>Binary quantization</strong>, which is a subset of scalar quantization that reduces the dimensions to a single bit (e.g. <code>&gt; 0</code> to <code>1</code>, <code>&lt;=0</code> to <code>0</code>).</p>
</li>
<li><p><strong>Product quantization</strong>, which uses a clustering technique to effectively remaps the original vector to a vector with smaller dimensionality and indexes that (e.g. reduce a vector from 128-dim to 8-dim).</p>
</li>
</ul>
<p>Scalar quantization is often the easiest way to reduce vector index storage. It involves converting dimensions to a smaller data type, like changing a 4-byte float to a 2-byte float.</p>
<p>In many cases, using a 2-byte float makes sense because, during distance operations, the most important differences between two dimensions are in the more significant bits. By slightly reducing the information to focus on those bits, we shouldn't notice much difference in recall.</p>
<p>In addition, ColPali original implementation used Bfloat16. So, those extra bits if we were to convert to 4-byte float are imprecise anyway.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">it's worth noting that Bfloat16 is not the same as Float16 (IEEE half-precision), even though both are 16-bit formats.</div>
</div>

<p>Very rarely you get a free lunch with quantization but, here we are, it looks like we do really get a free lunch in this particular instance.</p>
<h2 id="heading-pgvector-performance">pgVector performance</h2>
<p>Jonathan Katz, the pgVector maintainer have benchmarked and evaluated halfvecs in an excellent <a target="_blank" href="https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/">post</a> - which we highly recommend. In summary, you get near-identical performance between halfvecs and full vectors. However, you cut your storage in half, and you get slight speedups.</p>
<p>This was proof enough for us on the savings. But, Late Interactions embeddings are really a different beast than normal embeddings. So, we needed to validate performance.</p>
<p>We ran the ArxivQ portion of the Vidore benchmark, and our score was 86.6. matching state of the art results in the vidore leaderboard at the time we ran it. This is made us comfortable that there are no significant performance cost to using halfvecs to proceed.</p>
<h2 id="heading-future-work">Future work</h2>
<p>Optimizing vector storage with halfvecs is a first step on making ColPali architecture viable and cost-effective. We plan to explore a few more optimization in the future, specifically around latency and use of re-rankers.</p>
<p>The ColPali architecture uses MaxSim to calculate relevancy. At larger document corpus, the MaxSim calculation is a significant overhead and with less than ideal latency.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">MaxSim (Maximum Similarity) is a method for measuring relevance between a query and a document by finding the maximum similarity score between query terms and document terms.</div>
</div>

<p>Most “traditional” RAG architecture uses Cosine similarity to calculate similarity as a first-step. So, in a sense - this is our baseline. MaxSim is more computationally intense than cosine similarity because it compares each query term with every document term.</p>
<p>While cosine similarity does just one vector comparison, MaxSim does many:</p>
<ul>
<li>If there are <code>n</code> terms in the query and <code>m</code> in the document, MaxSim needs <code>n × m</code> cosine similarity like-calculations, making it much slower.</li>
</ul>
<p>So, MaxSim could be <strong>100 to 5,000 times more costly</strong> than cosine similarity, depending on the number of terms.</p>
<p>We believe that the way to solve that via re-rankers. In a practical sense, we would run a fast search to narrow down the number of documents, then run MaxSim on those. Instead of running MaxSim on a 1000 documents, we would run them only on 10.</p>
<p>Our next step is an automated evaluation pipeline - so, we can accurately identify and optimize this process. We believe that a combination of native vector Postgres search then MaxSim is probably the best balance. But we want a good foundations of automated evaluations first.</p>
<h2 id="heading-binary-quantization">Binary <strong>Quantization</strong></h2>
<p>Binary quantization is a more extreme technique that reduces the full value of a vector's dimension to just a single bit of information. Specifically, it converts any positive value to <code>1</code> and any zero or negative value to <code>0</code>.</p>
<p>For further storage optimizations, we ran a few quick experiments with Binary Quantization, and came to the conclusion that the performance penalty is difficult to determine as the bit diversity is not easily measured.</p>
<p>Bit diversity depends both on the embedding models, its size, and the data being embedded. Our eval data, and our customers data could look very different, so it is difficult to measure the effects.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">One common technique with Binary Quantization is to use Hamming distance scores to measure similarity. Hamming scores calculate the number of bit positions that differ between two binary strings, providing a simple similarity metric for binary data where a score of 0 indicates identical strings</div>
</div>

<p>We could explore future pipelines where we run Hamming distance scores, then MaxSim. However, this will increase storage requirements, as you need to save both halfVecs and binary bits and could be less predictable than standard Postgres vector search.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We recommend using halfvecs as the starting point for efficient vector storage. The performance loss is minimal, and the storage savings are substantial. In ColiVara, where we built on top of pgVector and Postgres, we experienced no performance loss and achieved a 50% reduction in storage usage.</p>
]]></content:encoded></item><item><title><![CDATA[ColiVara: a state of the art retrieval API with a delightful developer experience.]]></title><description><![CDATA[We are launching ColiVara today!
ColiVara is a state of the art retrieval API that stores, searches, and retrieves documents based on their visual embedding. End to end it uses vision models instead of chunking and text-processing for documents.
In s...]]></description><link>https://blog.colivara.com/colivara-a-state-of-the-art-retrieval-api-with-a-delightful-developer-experience</link><guid isPermaLink="true">https://blog.colivara.com/colivara-a-state-of-the-art-retrieval-api-with-a-delightful-developer-experience</guid><category><![CDATA[RAG ]]></category><category><![CDATA[AI]]></category><category><![CDATA[#ai-tools]]></category><dc:creator><![CDATA[Jonathan Adly]]></dc:creator><pubDate>Mon, 11 Nov 2024 16:47:16 GMT</pubDate><content:encoded><![CDATA[<p>We are launching ColiVara today!</p>
<p>ColiVara is a state of the art retrieval API that stores, searches, and retrieves documents based on their visual embedding. End to end it uses vision models instead of chunking and text-processing for documents.</p>
<p>In simple terms, we ask the AI models to "see" and reason, rather "read", and reason. From the user's perspective, it functions like retrieval augmented generation (RAG) but uses vision models instead of chunking and text-processing for documents.</p>
<p>You can read more about the project on <a target="_blank" href="https://github.com/tjmlabs/ColiVara">Github</a> or browse the <a target="_blank" href="http://docs.colivara.com">documentations</a>. You can also try it for free at <a target="_blank" href="https://colivara.com">colivara.com</a>.</p>
<h2 id="heading-why">Why?</h2>
<p>RAG (Retrieval Augmented Generation) is a powerful technique that allows us to enhance large language models (LLMs) output with private documents and proprietary knowledge that is not available elsewhere. For example, a company's internal documents or a researcher's notes.</p>
<p>However, it is limited by the quality of the text extraction pipeline. With limited ability to extract visual cues and other non-textual information, RAG can be suboptimal for documents that are visually rich. ColiVara uses vision models to generate embeddings for documents, allowing you to retrieve documents based on their visual content.</p>
<p>In addition to RAG, ColiVara works as a visual data extraction pipeline. Most systems today don't have APIs and for LLMs to interact with them, they have to process everything visually. ColiVara allows you to treat anything as an image - and get whatever data from there, the same way a human would. What you see, is what you get.</p>
<h2 id="heading-tech">Tech</h2>
<p>ColiVara is a web-first implementation of the <a target="_blank" href="https://arxiv.org/abs/2407.01449"><code>ColPali: Efficient Document Retrieval with Vision Language Models</code></a> paper, featuring several optimizations. First, we re-implemented the scoring from the paper using Postgres and Pgvector, wrapped in Django ORM. This allows any developer to use the API and contribute. There's no need for PyTorch, CUDA, or code that only works in notebooks but not on servers.</p>
<p>For the embeddings, we dockerized and optimized the pipeline for serverless workloads.</p>
<p>Finally, we built the entire pipeline as an API with a great developer experience to support production workloads.</p>
<h2 id="heading-evals">Evals</h2>
<p>We will write a dedicated article to our evaluation process, including reproducibility and continuous improvements. The latest benchmark score we hit is an ArxivQ score of 86.6 - matching state of the art results in the <a target="_blank" href="https://huggingface.co/spaces/vidore/vidore-leaderboard">Vidore leaderboard</a> and a score higher than the original ColPali paper which scored at <strong>79.1</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731342624260/edf00a0d-d417-4999-b39a-e6b901ecdd8c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-usage">Usage</h2>
<p>We use vision models a ton today in production and ColiVara is informed by our use-cases. I will highlight a couple here:</p>
<p>We automate the data entry and processing of about 10,000 prescriptions/week today, working with Lilly Direct and Gifthealth pharmacy. These tasks, often need to be read and reason on what is in the screen. Sometimes- we need to save and search through all of this data. We use ColiVara to power the image workflows and search.</p>
<p>For traditional RAG, we would like to highlight <a target="_blank" href="https://onlabel.ai">OnLabel.ai</a> - something we alsoworked on. Often, we get all kind of documents. From manufacturers drug discount coupons, to large unorganized tables, to charts inside PowerPoint, and many clinical trials where the main takeaways are summarized in charts and tables.</p>
<p>We built ColiVara to solve for difficult tasks, where accuracy and recall must be precise and happen over visually rich documents.</p>
<h2 id="heading-quickstart">Quickstart</h2>
<ol>
<li><p>Get a free API Key from the <a target="_blank" href="https://colivara.com">ColiVara Website</a>.</p>
</li>
<li><p>Install the Python SDK and use it to interact with the API.</p>
<p> <code>pip install colivara-py</code></p>
</li>
<li><p>Index a document. Colivara accepts a file url, or base64 encoded file, or a file path. We support over 100 file formats including PDF, DOCX, PPTX, and more. We will also automatically take a screenshot of URLs (webpages) and index them.</p>
<pre><code class="lang-python"> <span class="hljs-keyword">from</span> colivara_py <span class="hljs-keyword">import</span> ColiVara

 client = ColiVara(
     <span class="hljs-comment"># this is the default and can be omitted</span>
     api_key=os.environ.get(<span class="hljs-string">"COLIVARA_API_KEY"</span>),
     <span class="hljs-comment"># this is the default and can be omitted</span>
     base_url=<span class="hljs-string">"https://api.colivara.com"</span>
 )

 <span class="hljs-comment"># Upload a document to the default_collection</span>
 document = client.upsert_document(
     name=<span class="hljs-string">"sample_document"</span>,
     url=<span class="hljs-string">"https://example.com/sample.pdf"</span>,
     <span class="hljs-comment"># optional - add metadata</span>
     metadata={<span class="hljs-string">"author"</span>: <span class="hljs-string">"John Doe"</span>},
     <span class="hljs-comment"># optional - specify a collection</span>
     collection_name=<span class="hljs-string">"user_1_collection"</span>, 
     <span class="hljs-comment"># optional - wait for the document to index</span>
     wait=<span class="hljs-literal">True</span>
 )
</code></pre>
</li>
<li><p>Search for a document. You can filter by collection name, collection metadata, and document metadata. You can also specify the number of results you want.</p>
<pre><code class="lang-python"> <span class="hljs-comment"># Simple search</span>
 results = client.search(<span class="hljs-string">"what is 1+1?"</span>)
 <span class="hljs-comment"># search with a specific collection</span>
 results = client.search(<span class="hljs-string">"what is 1+1?"</span>, collection_name=<span class="hljs-string">"user_1_collection"</span>)
 <span class="hljs-comment"># Search with a filter on document metadata</span>
 results = client.search(
     <span class="hljs-string">"what is 1+1?"</span>,
     query_filter={
         <span class="hljs-string">"on"</span>: <span class="hljs-string">"document"</span>,
         <span class="hljs-string">"key"</span>: <span class="hljs-string">"author"</span>,
         <span class="hljs-string">"value"</span>: <span class="hljs-string">"John Doe"</span>,
         <span class="hljs-string">"lookup"</span>: <span class="hljs-string">"key_lookup"</span>,  <span class="hljs-comment"># or contains</span>
     },
 )
 <span class="hljs-comment"># Search with a filter on collection metadata</span>
 results = client.search(
     <span class="hljs-string">"what is 1+1?"</span>,
     query_filter={
         <span class="hljs-string">"on"</span>: <span class="hljs-string">"collection"</span>,
         <span class="hljs-string">"key"</span>: [<span class="hljs-string">"tag1"</span>, <span class="hljs-string">"tag2"</span>],
         <span class="hljs-string">"lookup"</span>: <span class="hljs-string">"has_any_keys"</span>,
     },
 )
 <span class="hljs-comment"># top 3 pages with the most relevant information</span>
 print(results)
</code></pre>
</li>
</ol>
<h2 id="heading-key-features">Key Features</h2>
<ul>
<li><p><strong>State of the Art retrieval</strong>: The API is based on the ColPali paper and uses the ColQwen2 model for embeddings. It outperforms existing retrieval systems on both quality and latency.</p>
</li>
<li><p><strong>User Management</strong>: Multi-user setup with each user having their own collections and documents.</p>
</li>
<li><p><strong>Wide Format Support</strong>: Supports over 100 file formats including PDF, DOCX, PPTX, and more.</p>
</li>
<li><p><strong>Webpage Support</strong>: Automatically takes a screenshot of webpages and indexes them even if it is not a file.</p>
</li>
<li><p><strong>Collections</strong>: A user can have multiple collections. For example, a user can have a collection for research papers and another for books. Allowing for efficient retrieval and organization of documents.</p>
</li>
<li><p><strong>Documents</strong>: Each collection can have multiple documents with unlimited and user-defined metadata.</p>
</li>
<li><p><strong>Filtering</strong>: Filtering for collections and documents on arbitrary metadata fields. For example, you can filter documents by author or year. Or filter collections by type.</p>
</li>
<li><p><strong>Convention over Configuration</strong>: The API is designed to be easy to use with opinionated and optimized defaults.</p>
</li>
<li><p><strong>Modern PgVector Features</strong>: We use HalfVecs for faster search and reduced storage requirements.</p>
</li>
<li><p><strong>REST API</strong>: Easy to use REST API with Swagger documentation.</p>
</li>
<li><p><strong>Comprehensive</strong>: Full CRUD operations for documents, collections, and users.</p>
</li>
<li><p><strong>Dockerized</strong>: Easy to setup and run with Docker and Docker Compose.</p>
</li>
</ul>
<h2 id="heading-self-hosting">Self-hosting</h2>
<p>ColiVara is available for <a target="_blank" href="https://github.com/tjmlabs/ColiVara?tab=readme-ov-file#getting-started-local-setup">self-hosting</a> and we provide commercial support. It is licensed under Functional Source License, Version 1.1, Apache 2.0 Future License.</p>
<p>For questions, please contact us at <a target="_blank" href="mailto:founders@tjmlabs.com">founders@tjmlabs.com</a>. We are happy to work with you to provide an agreement that meets your needs.</p>
]]></content:encoded></item></channel></rss>