← back

Introducing VecZip: Embedding Compression Algorithm

VecZip is a novel compression method by DEJAN AI that reduces embedding dimensionality by retaining unique dimensions to improve AI performance and storage.

Listen

Machine learning models rely on embeddings to understand complex data like language and images. But these embeddings can be massive, creating huge bottlenecks for storage, processing, and speed. Traditional compression often strips away vital context. That is why DEJAN AI developed VecZip, a new approach designed to shrink embeddings without losing their meaning.

While standard techniques like Principal Component Analysis, or PCA, focus on dimensions with the highest variance, VecZip takes the opposite approach. It analyzes the data to find and keep the dimensions with the least commonality, preserving the most unique features. In practice, it can compress embeddings down to just sixteen dimensions.

This aggressive reduction shrinks file sizes by about fifty to one, drastically cutting storage and compute costs. But the real surprise is the performance. Tests show that VecZip actually improves accuracy on downstream tasks, like measuring sentence similarity. It also enhances real-world applications, from classifying search intent and clustering data to optimizing link recommendations.

By optimizing the essential features of embeddings, VecZip makes AI systems faster, cheaper, and more scalable.

Embeddings are vital for representing complex data in machine learning, enabling models to perform tasks such as natural language understanding and image recognition. However, these embeddings can be massive in size, creating challenges for storage, processing, and transmission. At DEJAN AI, we’ve developed VecZip, a novel approach to address this issue, and reduce the file size without compromising data quality, with the goal of improving the quality of AI processes.

The Challenge of Large Embeddings

While traditional compression techniques can help reduce file size, they are not always optimized for the unique structure of embeddings. They may also not be optimized to preserve essential semantic or contextual information. This is where VecZip excels.

VecZip Approach

VecZip is a compression method designed to reduce the dimensionality of embeddings while focusing on retaining the most salient information. It works by identifying and removing dimensions that are less informative and keeping those that are the most unique, focusing on the areas with the least commonality.

This has the impact of reducing embedding sizes, but also improving the performance of the AI when used in downstream tasks.

  1. Dimensionality Analysis: VecZip analyzes the distribution of values across all samples. Dimensions with high commonality are considered less important.
  2. Feature Selection: VecZip retains the dimensions with the least commonality, effectively keeping the most unique aspects of the embeddings. In our current implementation, we target a reduction to just 16 dimensions.
  3. Compressed Representation: The result is a compact representation of the original data, with minimal loss of critical information and an overall reduced file size.

VecZip vs. PCA

In the context of dimensionality reduction, PCA (Principal Component Analysis) is a commonly used technique. However, unlike PCA, which preserves the dimensions with the most variance across the entire dataset, VecZip uses an approach that emphasizes the least common dimensions.

  • PCA (Left): Performs better at light to moderate dimensionality reduction.
  • VecZip (Right): Performs better at aggressive reduction.
Mode | LastWriteTime | Length Name
---- ------------- ------ ----
-a---- 9/12/2024 12:52 AM 246830957 embeddings.csv (235MB)
-a---- 12/12/2024 9:15 PM 4584099 zipped-embeddings.csv (4.37MB)

Test Results and Key Findings

To evaluate the effectiveness of VecZip, we conducted tests using the sentence-transformers/stsb dataset. We compared the results of using both original embeddings and compressed embeddings across a variety of tasks, here are the most prominent results:

  • Enhanced Similarity Scores: On a sentence similarity task, VecZip led to embeddings with a lower mean absolute difference from the “true” scores when compared to the original, higher dimension embeddings.
  • Significant Compression: The data was also compressed by approximately 50:1, which greatly reduces the required storage space and can improve the speed of processing embeddings.
Vector embeddings.

Top two rows are the VecZip pruned embeddings for two sentences compared to the original below. Helpful for intuitive understanding of the impact this method has on file size.

Broader Applications

At DEJAN AI, we apply dimensionality reduction techniques to improve many aspects of our client’s work.

  • Link Recommendations: Reduced embeddings aid in improving the quality of internal link recommendations.
  • Anchor Text Selection: We see enhanced performance when aiding anchor text selection tasks using VecZip .
  • Query Intent Classification: These techniques also improve our ability to classify user query intent.
  • Clustering: The improved clustering behavior of the compressed embeddings gives us a better overview of the data as a whole.
  • CTR Optimization: We apply compressed embeddings to help optimize click-through rates.
  • General NLP Tasks: VecZip can improve performance of many other NLP tasks.
  • Reduced Costs: Additionally, by greatly reducing the number of dimensions, we see improvements in storage needs as well as a reduced compute overhead.

VecZip is an important step in developing efficient AI tools. By optimizing the feature space of embeddings, while improving downstream task performance, it paves the way for more scalable and performant AI systems.

We encourage the research and development community to explore the potential of VecZip, and we hope this approach enables further innovation in the field of machine learning.

pip install dejan
dejan veczip embeddings.csv zipped-embeddings.csv

Dan Petrovic · Dec 12, 22:12

What is the GitHub repository for ‘dejan’ because I can’t find it on PyPi.

LeMoussel · QuestionsExpands · · Dec 12, 15:08

I messed up the repo and took it down until I fix it up. Wheel based install should be enough to take it for a spin. If you need any details feel free to ping me.

Dan Petrovic · Expands · · Dec 13, 01:57

Possible to do pip install from Git repository?
E.g : pip install git+https://github.com/….

LeMoussel · Questions · · Dec 13, 10:12
Dan Petrovic · Expands · · Dec 13, 10:42