ColiVara: A State of the Art Retrieval API for AI workflows

Unlocking 70% Faster Response Times Through Token Pooling

Jonathan Adly — Mon, 02 Dec 2024 18:31:27 GMT

TLDRThis post examines improvements made to ColiVara, our ColPali-based retrieval API. We focus on hybrid search and hierarchical clustering token pooling. By benchmarking these two approaches, we aim to evaluate their impact on latency and performan...

From Cosine to Dot: Benchmarking Similarity Methods for Speed and Precision

Jonathan Adly — Wed, 20 Nov 2024 15:25:42 GMT

Background

Retrieval Augmented Generation (RAG) empowers large language models (LLMs) by integrating private documents and proprietary knowledge, unlocking their potential for nuanced and informed responses. However, efficiently extracting information from unstructured documents—especially those with complex visual layouts—remains a significant challenge.

The conventional approach to handling visually dense documents typically involves a multi-stage process: Optical Character Recognition (OCR) to extract text, Layout Recognition to understand the document structure, Figure Captioning to interpret images, Chunking to segment the text, and finally, Embedding to represent each segment in a vector space. This pipeline is not only complex and computationally demanding but also prone to error propagation. Inaccuracies in any stage, for example, OCR errors or misinterpretation of visual layouts, can significantly degrade the quality of downstream retrieval and generation.

A more streamlined approach, as pioneered in the ColPali paper, leverages the power of vision models. Instead of complex pre-processing, this method directly embeds entire document pages as images, simplifying the retrieval process to a similarity search on these visual embeddings. This approach eliminates the need for laborious pre-processing steps and potentially captures richer contextual information from the visual layout and style of the document. This innovative approach forms the foundation of our work in ColiVara.

This post details our research on optimizing and benchmarking various similarity calculation methods, building upon and extending the core ideas presented in ColPali.

💡

A key contribution of ColiVara lies in its API-first design. We prioritizes developer experience and integration into real-world applications. This architecture, however, introduces practical considerations related to network latency and data storage.

We store vector embeddings in a PostgreSQL database, employing a one-to-many relationship between vectors and their corresponding pages.

Research Question

The ColPali paper uses the late-interaction style computations to calculate the relevancy between a query and documents. The question we wanted to answer, could we get the same results using other similarity computations with better latency?

Our metric, to stay consistent with the paper was the NCDG@5 score in DocVQA and our typical API request latency using:

Late-interaction Cosine Similarity
Hamming distance with Binary Quantization of vectors
Hamming distance as above with late-interaction re-ranking

The core idea in simple terms in the ColPali paper implementation of late-interaction is:

We have query embeddings: an array with n vectors, each of size 128 (floats).
We have document embeddings: an array with 1038 vectors, each of size 128 (floats).

For each query vector (n), we find the most similar document vector by computing the dot product. Here is simple code for straight-forward dot product similarity.

  import numpy as np

  # Define the query vector
  query = np.array([1, 2, 3])

  # Define the document vectors
  document_a = np.array([4, 5, 6])
  document_b = np.array([7, 8, 0])

  # Compute the dot product between the query and each document
  dot_product_a = np.dot(query, document_a)  # 1*4 + 2*5 + 3*6 = 32
  dot_product_b = np.dot(query, document_b)  # 1*7 + 2*8 + 3*0 = 23

  # Output the results
  print(f"Dot product between query and document a: {dot_product_a}")
  print(f"Dot product between query and document b: {dot_product_b}")
  # document a is more similar than document b to our query

Late interaction works like this: take each query vector, find which document vector it matches best with (using dot product), and add up all these best matches. Here's a simple example that shows exactly how this works:

import numpy as np

# One query generates multiple vectors (each vector is 128 floats)
query_vector_1 = np.array([1, 2, 3])  # simplified from 128 floats
query_vector_2 = np.array([0, 1, 1])  # simplified from 128 floats
query = [query_vector_1, query_vector_2]  # one query -> n vectors

# One document generates multiple vectors (each vector is 128 floats)
document_vector_1 = np.array([4, 5, 6])  # simplified from 128 floats
document_vector_2 = np.array([7, 8, 0])  # simplified from 128 floats
document_vector_3 = np.array([1, 1, 1])  # simplified from 128 floats
document = [document_vector_1, document_vector_2, document_vector_3]  # one document -> 1038 vectors

# For each vector in our query, find its maximum dot product with ANY vector in the document
# Then sum these maximums

# First query vector against ALL document vectors
dot_products_vector1 = [
    np.dot(query_vector_1, document_vector_1),  # 1*4 + 2*5 + 3*6 = 32
    np.dot(query_vector_1, document_vector_2),  # 1*7 + 2*8 + 3*0 = 23
    np.dot(query_vector_1, document_vector_3)   # 1*1 + 2*1 + 3*1 = 6
]
max_similarity_vector1 = max(dot_products_vector1)  # 32

# Second query vector against ALL document vectors
dot_products_vector2 = [
    np.dot(query_vector_2, document_vector_1),  # 0*4 + 1*5 + 1*6 = 11
    np.dot(query_vector_2, document_vector_2),  # 0*7 + 1*8 + 1*0 = 8
    np.dot(query_vector_2, document_vector_3)   # 0*1 + 1*1 + 1*1 = 2
]
max_similarity_vector2 = max(dot_products_vector2)  # 11

# Final similarity score is the sum of maximum similarities
final_score = max_similarity_vector1 + max_similarity_vector2  # 32 + 11 = 43

print(f"Final similarity score: {final_score}")

Now - as you can imagine, this is computationally heavy at scale (or so we thought!). So, we looked for ways to see if we can make it faster and maybe even better.

Baseline implementation

To get a realistic picture on real-life performance, we set the parameters as close to our production setup at ColiVara as possible. Embeddings were stored in Postgres with a pgVector extensions. Everything ran in an AWS r6g.xlarge (4 core CPU, 32g ram) and called from our python backend code hosted in a VPS.

💡

As we do things over the network, latency is also affected by where the user is in the globe and their network conditions.

We implemented the paper late-interaction calculation as a Postgres function as such:

 CREATE OR replace FUNCTION max_sim(document halfvec[],
                                   query halfvec[]) returns DOUBLE PRECISION
AS
  $$ WITH queries AS
  (
         SELECT row_number() OVER () AS query_number,
                *
         FROM   (
                       SELECT unnest(query) AS query) ), documents AS
  (
         SELECT unnest(document) AS document ), similarities AS
  (
             SELECT     query_number,
                        (document <#> query) * -1 AS similarity
             FROM       queries
             CROSS JOIN documents ), max_similarities AS
  (
           SELECT   max(similarity) AS max_similarity
           FROM     similarities
           GROUP BY query_number )SELECT Sum(max_similarity)
  FROM   max_similarities;

The magic line here is (document <#> query) * -1. The <#> symbol is from pgVector and we multiply with -1 as gives negative inner product by default. We also use halfvecs to store embeddings for efficiency.

Results:

On DocVQA using NCDG@5 and end to end latency we had the following:

Average NDCG@5 score: 0.55
Average latency: 3.1 seconds

The dataset is composed of 500 pages.

💡

DCG is a measure of relevance that considers the position of relevant results in the returned list. It assigns higher scores to results that appear earlier. Normalized Discounted Cumulative Gain normalizes DCG by dividing it by the ideal DCG (IDCG) for a given query, providing a score between 0 and 1. In this project, we calculate NDCG@5 to evaluate the top 5 search results for each query.

Late-Interaction Cosine Similarity

Having ran a few large RAG applications before, we were intimately familiar with how cosine similarity works. Cosine similarity normalizes for vector length and gives a few nice additions.

Values are always between 0 and 1
This means vectors pointing in the same direction but with different magnitudes will be considered similar
It is more computationally more expensive than dot product as it needs to calculate vector magnitudes (requires square roots) but also more performant

Our first attempt was meant to establish what would happen if we kept everything the same, but substituted dot product with cosine similarity.

So, we modified our function by replacing the dot product with cosine similarity.

Before:

(document <#> query) * -1 AS similarity

After:

1 - (document <=> query) AS similarity

Then - we ran our evaluations again.

Results:

Average NDCG@5 score: 0.55
Average latency: 3.25 seconds

So - our latency went up, but our scores didn’t really improve. It was good to see though that the main latency-driver is not the math behind the calculation. We could use it, and will likely be okay. However, both the latency benefits and the improvements are marginal.

Binary Quantization and Hamming Distance

Binary quantization is a compression technique for vector databases. It converts 32-bit floating point numbers into binary values (1 bit), achieving significant memory savings. The conversion process typically preserves the relative relationships between vectors while sacrificing some precision. For example:

embeddings = [0.123, -0.456, 0.789, -0.012]
quantized_embeddings = [1, 0, 1, 0]

To measure similarity when we have bits as our embeddings, we use the hamming distance. Hamming distance is a measure of the number of differing bits between two binary strings of equal length. It's calculated by comparing the corresponding bits of the two strings and counting the positions at which the bits differ.

💡

For example, the Hamming distance between '1010' and '1100' is 2 because there are two positions where the bits differ.

First - we converted all our embeddings to be binary and stored them in their own column. Then - we changed our similarity function to be as the pgVector-python documentation recommends.

1 - ((document <~> query) / bit_length(query)

Finally - we ran our evaluations.

Results:

Average NDCG@5 score: 0.54

Average latency: 3.25 seconds

The good news is that the precision loss is minimal. We are hesitant to embrace it fully even with these results, as the loss can be unpredictable. This is essentially a measure of bit diversity, which partly depends on the data being embedded—so your mileage may vary.

The interesting tidbit here is latency is slower, as slow as Cosine similarity. We can almost conclude that if we want to make the process faster, what similarity calculation we should use is probably irrelevant.

💡

The main benefit of Binary Quantization is storage cost. It significantly cuts down how much space your embeddings take. At larger datasets, the precision loss is probably worth the savings.

Binary Quantization with Re-ranking w/ 100 documents

The final evaluation we wanted to conduct before wrapping up this experiment was to see if we could use Hamming distance and then re-rank using dot product late interactions. If our theory is correct—that the type of similarity calculation doesn't matter much—we should expect a slight increase in latency (since we're doing two calculations) and no significant improvement in performance.

That's exactly what happened.

Results:

Average NDCG@5 score: 0.55
Average latency: 3.5 seconds

The bottleneck

Postgres is highly optimized, so adding a square root or doing a bit of extra multiplication has a minimal impact. It's an obvious realization in hindsight, but without conducting evaluations and this experiment, we wouldn't have been able to prove it.The real bottleneck is here:

FROM queries CROSS JOIN documents

the CROSS JOIN is the major bottleneck ! Let's break down the scale:

For each comparison:

Query: n vectors
Document: 1038 vectors per page
Corpus: 500 pages

So for ONE query against ONE page we're doing: n * 1038 vector comparisons (CROSS JOIN)

And for the full 500 pages corpus: n * 1038 * 500 vector comparisons

This creates a massive cartesian product. For example, if n=10 (for query):

10,380 comparisons per page
5,190,000 comparisons for 500 pages

The CROSS JOIN is creating this quadratic complexity O(n*m) where:

n = number of query vectors
m = number of document vectors

So - the math behind the comparisons is relatively inconsequential. The real optimization has to happen before we reach this point.

What can be done?

The crucial optimization insight that we got out of this experiment, is to hyper-focus on the number of pages/documents we are running through our similarity function. Nothing else really matters.

if instead of CROSS JOIN documents (1038 vectors * 500 pages), we first filter:

CROSS JOIN (
    SELECT * FROM documents 
    WHERE page_id IN (
        -- something to pre-filter to maybe 10-20 relevant pages
        -- instead of all 500
    )
)

All of the sudden, we gain massive performance gains. The insight here is for this specific problem: 1038 * 10 pages is so much better than 1038 * 500 pages.

It's like finding a book in a library:

First narrow to the right section (cheap and quick search)
Then look for the exact content (expensive vector similarity)

We currently use and recommend advanced filters in ColiVara to reduce the document set. When a user uploads a document, they can add arbitrary JSONs as metadata. It can be very complex or simple. At query time, they can use filtering to reduce the corpus size. This is a powerful technique, but can be complex to setup.

Future work

There is a couple of approaches we planning to tackle to reduce the number of documents without any effort from the user side. The first is a few experiments using Postgres full text search. This will take us away from vision models, but could be straightforward win.

The second is to use vision LLMs generated metadata of documents to store semantic information about the documents automatically at indexing. We can then use existing filtering to reduce the corpus size via that metadata.

The main problem vision models solve in our experience, is that traditional OCR pipelines can’t capture the full visual cues in documents. Here, we don’t want to capture everything, just enough to narrow the search and then vision models will do the rest.

Conclusion

We benchmarked and experimented with several similarity calculations setups for our RAG API using vision models. The differences are marginal and doesn’t meaningfully affect performance or latency.

Due to the large size of embeddings produced, the best performance optimization is to reduce the number of documents that go through the similarity calculation, rather than optimize that similarity calculation itself.

Optimizing Vector Storage with halfvecs

Jonathan Adly — Thu, 14 Nov 2024 15:47:32 GMT

RAG (Retrieval Augmented Generation) is a powerful technique that allows us to enhance large language models (LLMs) output with private documents and proprietary knowledge that is not available elsewhere. For example, a company's internal documents or a researcher's notes.

There are many ways to give relevant context to LLMs in your RAG system. We can use a simple keyword search in your database or more advanced search algorithms like BM25, which go beyond keyword search. Here is an example of a simple keyword search.

SELECT *
FROM articles
WHERE content LIKE '%keyword%';

A step further, we can use pretrained language models to create embeddings that provide a lot of information through high dimensionality. Here is a simple example from the SentenceTransformers excellent library.

from sentence_transformers import SentenceTransformer

# 1. Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Our documents and query
documents = [
    "Python is great for programming",
    "I have a pet dog",
    "The weather is sunny today"
]
query = "How to program in Python?"

# 3. Calculate embeddings
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# 4. Find similarities
similarities = model.similarity(query_embedding, doc_embeddings)
print(f"Most similar document: {documents[similarities.argmax()]}")
# Python is great for programming

In recent years, pretrained language models have greatly improved text embedding models. However, in our experience, the main challenge for efficient document retrieval is not the performance of the embedding model but the earlier data ingestion process.

💡

The goal of RAG is to provide LLMs with relevant context to enhance their performance. This is the essence of RAG, and it can be tailored to be as simple or complex as the specific use-case demands.

The process of turning documents into passages of text via OCR, chunking, and a complex pipeline of data cleaning is fragile and error prone. For one of our projects doing RAG over clinical trials, we lost over 30% of the context during the process.

ColPali

One advanced technique to improve this process is a retrieval model architecture called ColPali. It uses the document understanding abilities of recent Vision Language Models to create embeddings directly from images of document pages. ColPali significantly outperforms modern document retrieval pipelines while being much faster.

One of the trade-offs of this new retrieval method is that while "late interaction" allows for more detailed matching between specific parts of the query and the potential context, it requires more computing power than simple vector comparisons and produces up to 100 times more embeddings per page.

These trade-offs are often worthwhile in highly visual documents and situations where accuracy is crucial.

Here, we would highlight one of our many optimization in ColiVara, where we leveraged halfvecs as our preferred method of Scalar quantization.

ColiVara

ColiVara is a state of the art retrieval API that stores, searches, and retrieves documents based on their visual embedding. End to end it uses vision models instead of chunking and text-processing for documents.

In simple terms, we ask the AI models to "see" and reason, rather "read", and reason. From the user's perspective, it functions like retrieval augmented generation (RAG) but uses vision models instead of chunking and text-processing for documents.

It is a web-first implementation of the ColPali: Efficient Document Retrieval with Vision Language Models paper.

Like many AI/ML RAG systems, we create and store vectors when we save a user’s document. Since we use ColPali under the hood, each page generate an embeddings that looks like this.

# List of 1030 members, each a list of 128 floats per page
embeddings = [[0.1, 0.2, ..., 0.128], [0.1, 0.2, ...]]

Let's calculate the storage requirements for this:

Each float is 4 bytes.
Each embedding has 128 dimensions, so: 128 * 4 bytes = 512 bytes per embedding.
Total embeddings: 1030.
Total storage: 1030 * 512 bytes = 527,360 bytes ≈ 515 KB per page.

If we have a 100-page document and a collection of 100 documents, then:

515 KB * 100 pages = 51.5 MB per document.
51.5 MB * 100 documents = 5.15 GB per collection.

This calculation is just for the raw numerical data. Actual memory usage in Python might be slightly higher due to Python's object overhead and list structure. ~5 GB per collection is manageable, but not exactly lightweight. So, we explored different quantization methods to better manage our resource usage.

Quantization

There are three common quantization techniques around vector databases:

Scalar quantization, which reduces the overall size of the dimensions to a smaller data type (e.g. a 4-byte float to a 2-byte float or 1-byte integer).
Binary quantization, which is a subset of scalar quantization that reduces the dimensions to a single bit (e.g. > 0 to 1, <=0 to 0).
Product quantization, which uses a clustering technique to effectively remaps the original vector to a vector with smaller dimensionality and indexes that (e.g. reduce a vector from 128-dim to 8-dim).

Scalar quantization is often the easiest way to reduce vector index storage. It involves converting dimensions to a smaller data type, like changing a 4-byte float to a 2-byte float.

In many cases, using a 2-byte float makes sense because, during distance operations, the most important differences between two dimensions are in the more significant bits. By slightly reducing the information to focus on those bits, we shouldn't notice much difference in recall.

In addition, ColPali original implementation used Bfloat16. So, those extra bits if we were to convert to 4-byte float are imprecise anyway.

💡

it's worth noting that Bfloat16 is not the same as Float16 (IEEE half-precision), even though both are 16-bit formats.

Very rarely you get a free lunch with quantization but, here we are, it looks like we do really get a free lunch in this particular instance.

pgVector performance

Jonathan Katz, the pgVector maintainer have benchmarked and evaluated halfvecs in an excellent post - which we highly recommend. In summary, you get near-identical performance between halfvecs and full vectors. However, you cut your storage in half, and you get slight speedups.

This was proof enough for us on the savings. But, Late Interactions embeddings are really a different beast than normal embeddings. So, we needed to validate performance.

We ran the ArxivQ portion of the Vidore benchmark, and our score was 86.6. matching state of the art results in the vidore leaderboard at the time we ran it. This is made us comfortable that there are no significant performance cost to using halfvecs to proceed.

Future work

Optimizing vector storage with halfvecs is a first step on making ColPali architecture viable and cost-effective. We plan to explore a few more optimization in the future, specifically around latency and use of re-rankers.

The ColPali architecture uses MaxSim to calculate relevancy. At larger document corpus, the MaxSim calculation is a significant overhead and with less than ideal latency.

💡

MaxSim (Maximum Similarity) is a method for measuring relevance between a query and a document by finding the maximum similarity score between query terms and document terms.

Most “traditional” RAG architecture uses Cosine similarity to calculate similarity as a first-step. So, in a sense - this is our baseline. MaxSim is more computationally intense than cosine similarity because it compares each query term with every document term.

While cosine similarity does just one vector comparison, MaxSim does many:

If there are n terms in the query and m in the document, MaxSim needs n × m cosine similarity like-calculations, making it much slower.

So, MaxSim could be 100 to 5,000 times more costly than cosine similarity, depending on the number of terms.

We believe that the way to solve that via re-rankers. In a practical sense, we would run a fast search to narrow down the number of documents, then run MaxSim on those. Instead of running MaxSim on a 1000 documents, we would run them only on 10.

Our next step is an automated evaluation pipeline - so, we can accurately identify and optimize this process. We believe that a combination of native vector Postgres search then MaxSim is probably the best balance. But we want a good foundations of automated evaluations first.

Binary Quantization

Binary quantization is a more extreme technique that reduces the full value of a vector's dimension to just a single bit of information. Specifically, it converts any positive value to 1 and any zero or negative value to 0.

For further storage optimizations, we ran a few quick experiments with Binary Quantization, and came to the conclusion that the performance penalty is difficult to determine as the bit diversity is not easily measured.

Bit diversity depends both on the embedding models, its size, and the data being embedded. Our eval data, and our customers data could look very different, so it is difficult to measure the effects.

💡

One common technique with Binary Quantization is to use Hamming distance scores to measure similarity. Hamming scores calculate the number of bit positions that differ between two binary strings, providing a simple similarity metric for binary data where a score of 0 indicates identical strings

We could explore future pipelines where we run Hamming distance scores, then MaxSim. However, this will increase storage requirements, as you need to save both halfVecs and binary bits and could be less predictable than standard Postgres vector search.

Conclusion

We recommend using halfvecs as the starting point for efficient vector storage. The performance loss is minimal, and the storage savings are substantial. In ColiVara, where we built on top of pgVector and Postgres, we experienced no performance loss and achieved a 50% reduction in storage usage.

ColiVara: a state of the art retrieval API with a delightful developer experience.

Jonathan Adly — Mon, 11 Nov 2024 16:47:16 GMT

We are launching ColiVara today!

You can read more about the project on Github or browse the documentations. You can also try it for free at colivara.com.

Why?

However, it is limited by the quality of the text extraction pipeline. With limited ability to extract visual cues and other non-textual information, RAG can be suboptimal for documents that are visually rich. ColiVara uses vision models to generate embeddings for documents, allowing you to retrieve documents based on their visual content.

In addition to RAG, ColiVara works as a visual data extraction pipeline. Most systems today don't have APIs and for LLMs to interact with them, they have to process everything visually. ColiVara allows you to treat anything as an image - and get whatever data from there, the same way a human would. What you see, is what you get.

Tech

ColiVara is a web-first implementation of the ColPali: Efficient Document Retrieval with Vision Language Models paper, featuring several optimizations. First, we re-implemented the scoring from the paper using Postgres and Pgvector, wrapped in Django ORM. This allows any developer to use the API and contribute. There's no need for PyTorch, CUDA, or code that only works in notebooks but not on servers.

For the embeddings, we dockerized and optimized the pipeline for serverless workloads.

Finally, we built the entire pipeline as an API with a great developer experience to support production workloads.

Evals

We will write a dedicated article to our evaluation process, including reproducibility and continuous improvements. The latest benchmark score we hit is an ArxivQ score of 86.6 - matching state of the art results in the Vidore leaderboard and a score higher than the original ColPali paper which scored at 79.1.

Usage

We use vision models a ton today in production and ColiVara is informed by our use-cases. I will highlight a couple here:

We automate the data entry and processing of about 10,000 prescriptions/week today, working with Lilly Direct and Gifthealth pharmacy. These tasks, often need to be read and reason on what is in the screen. Sometimes- we need to save and search through all of this data. We use ColiVara to power the image workflows and search.

For traditional RAG, we would like to highlight OnLabel.ai - something we alsoworked on. Often, we get all kind of documents. From manufacturers drug discount coupons, to large unorganized tables, to charts inside PowerPoint, and many clinical trials where the main takeaways are summarized in charts and tables.

We built ColiVara to solve for difficult tasks, where accuracy and recall must be precise and happen over visually rich documents.

Quickstart

Get a free API Key from the ColiVara Website.
Install the Python SDK and use it to interact with the API.

pip install colivara-py

Index a document. Colivara accepts a file url, or base64 encoded file, or a file path. We support over 100 file formats including PDF, DOCX, PPTX, and more. We will also automatically take a screenshot of URLs (webpages) and index them.

 from colivara_py import ColiVara

 client = ColiVara(
     # this is the default and can be omitted
     api_key=os.environ.get("COLIVARA_API_KEY"),
     # this is the default and can be omitted
     base_url="https://api.colivara.com"
 )

 # Upload a document to the default_collection
 document = client.upsert_document(
     name="sample_document",
     url="https://example.com/sample.pdf",
     # optional - add metadata
     metadata={"author": "John Doe"},
     # optional - specify a collection
     collection_name="user_1_collection", 
     # optional - wait for the document to index
     wait=True
 )

Search for a document. You can filter by collection name, collection metadata, and document metadata. You can also specify the number of results you want.

 # Simple search
 results = client.search("what is 1+1?")
 # search with a specific collection
 results = client.search("what is 1+1?", collection_name="user_1_collection")
 # Search with a filter on document metadata
 results = client.search(
     "what is 1+1?",
     query_filter={
         "on": "document",
         "key": "author",
         "value": "John Doe",
         "lookup": "key_lookup",  # or contains
     },
 )
 # Search with a filter on collection metadata
 results = client.search(
     "what is 1+1?",
     query_filter={
         "on": "collection",
         "key": ["tag1", "tag2"],
         "lookup": "has_any_keys",
     },
 )
 # top 3 pages with the most relevant information
 print(results)

Key Features

State of the Art retrieval: The API is based on the ColPali paper and uses the ColQwen2 model for embeddings. It outperforms existing retrieval systems on both quality and latency.
User Management: Multi-user setup with each user having their own collections and documents.
Wide Format Support: Supports over 100 file formats including PDF, DOCX, PPTX, and more.
Webpage Support: Automatically takes a screenshot of webpages and indexes them even if it is not a file.
Collections: A user can have multiple collections. For example, a user can have a collection for research papers and another for books. Allowing for efficient retrieval and organization of documents.
Documents: Each collection can have multiple documents with unlimited and user-defined metadata.
Filtering: Filtering for collections and documents on arbitrary metadata fields. For example, you can filter documents by author or year. Or filter collections by type.
Convention over Configuration: The API is designed to be easy to use with opinionated and optimized defaults.
Modern PgVector Features: We use HalfVecs for faster search and reduced storage requirements.
REST API: Easy to use REST API with Swagger documentation.
Comprehensive: Full CRUD operations for documents, collections, and users.
Dockerized: Easy to setup and run with Docker and Docker Compose.

Self-hosting

ColiVara is available for self-hosting and we provide commercial support. It is licensed under Functional Source License, Version 1.1, Apache 2.0 Future License.

For questions, please contact us at founders@tjmlabs.com. We are happy to work with you to provide an agreement that meets your needs.