Boost Your Productivity: Generate 33k Page Embeddings in Just 4 Mins!
Table of Contents:
- Introduction
- Background
- Problem Statement
- The Solution: Using Array and Ray
- The Flow of the Embedding Generation Process
- Code Walkthrough
- Results and Performance
- Conclusion
- Further Resources
- Frequently Asked Questions
Introduction
Welcome to this article, where we will explore how to effectively and quickly embed large amounts of documents using Array and Ray. As a software engineer at Any Skill, I have encountered problems with building LM applications and embedding numerous PDF documents efficiently. In this talk, we will dive into the process of parallelizing embedding generation across multiple GPUs, allowing us to embed thousands of PDF documents in a matter of minutes.
Background
This article is the Second part of our series on LangChain and Ray, where we explore how these tools can be used together to build LM-Based applications. In the first video, we showcased how You can use LangChain and Ray together to build a search engine.
To build such an application, there are two main steps. The first step involves generating embeddings for a document corpus, while the second step focuses on deploying the application to handle user queries.
Problem Statement
In this article, we will specifically address the challenges encountered during the first step of the LM application development process - generating embeddings from a large set of documents. When dealing with a sizable document corpus, this process can be time-consuming and hinder rapid development and experimentation with different embedding models, chunk sizes, and document variations. Therefore, the goal is to find a solution that can generate embeddings quickly, allowing developers to iterate on different approaches in just a few minutes.
The Solution: Using Array and Ray
To overcome the challenges of embedding a large amount of documents quickly, we utilize Array and the Array Datasets Library. By leveraging the Parallel processing power of multiple GPUs, we can significantly reduce the time required for embedding generation. With the use of Ray, a distributed computing framework, we can distribute the workload across many GPUs, enabling us to complete the process in a matter of minutes.
The Flow of the Embedding Generation Process
Let's take a closer look at the flow of the embedding generation process. The process begins with a set of 2000 PDF documents stored in an S3 bucket. These documents are loaded as bytes and converted into STRING text. Next, the text is split into chunks, which will be transformed into embeddings.
To encode these chunks into vectors, we use a pre-trained embedding model such as a Sentence Transformer model. This ensures that similar pieces of text have similar vector representations. Once we have the embeddings for all 30,000 pages, we can store them in a vector database for later retrieval.
This entire process occurs during the offline phase of development. In the online phase, when a user query is received, similar documents are pulled from the vector database, and both the user query and the similar documents are passed as Prompts to the large language model.
Code Walkthrough
Now, let's Delve into the code implementation using the Any Skill platform, a managed version of Ray. However, note that everything showcased here can be accomplished using the open-source Ray cluster launcher as well.
The code for generating the embeddings consists of a little over 100 lines. Despite its simplicity, the power of Array allows us to effectively parallelize the embedding generation process across all available GPUs in our cluster.
The code starts by reading the PDF documents from the S3 bucket as an Array Dataset using the Array Data APIs. We Apply transformations on this dataset to convert the bytes into text and split the text into chunks. Finally, we define a transformation class for encoding the chunks into vectors using the Sentence Transformer model. Each chunk's original text and embedding are returned.
By mapping these transformations across our dataset, we trigger the execution of the entire pipeline. Once the embeddings are generated, they can be stored in a vector database, just like in a traditional LM-based application using LangChain.
Results and Performance
Using the combined power of Array and Ray, we observed remarkable results in terms of performance. In just under four minutes, we were able to embed 2,000 PDF documents and store them in a vector database. This efficient parallelization saved us a significant amount of time compared to the single-GPU approach, which would have taken approximately an hour to complete.
Furthermore, as we monitored the GPU utilization using the Ray dashboard, we observed that all 20 GPUs in our cluster were nearly 100% utilized during the computation phase. This highlights the effectiveness of Array and Ray in harnessing the full potential of parallel processing across multiple GPUs.
Conclusion
In conclusion, the integration of Array and Ray makes the embedding generation process incredibly efficient and scalable. Developers can easily generate embeddings from PDF documents, allowing for fast-paced development and experimentation with different approaches. The combination of these tools enables LM-based applications to be developed and deployed more rapidly, without sacrificing performance.
Further Resources
- Documentation on Array and Array Datasets Library
- Ray official documentation
- LangChain project resources
Frequently Asked Questions
Q: Can the same approach be used for embedding other types of documents, such as Word or Excel files?
A: Yes, the same approach can be applied to other document types. The Array and Ray combination allows flexibility in processing different file formats, making it adaptable to various document embedding scenarios.
Q: What are the potential downsides or limitations of using Array and Ray for large-Scale embedding generation?
A: One limitation to consider is the hardware requirement. Array and Ray are designed to take AdVantage of multiple GPUs, so having access to a GPU cluster is crucial for optimal performance. Additionally, the initial setup and configuration may require some technical expertise.
Q: Are there any recommended alternative solutions for embedding large amounts of documents?
A: While Array and Ray provide an efficient solution for parallelizing embedding generation, alternative frameworks like TensorFlow or PyTorch can also be used. However, they might require more manual implementation and may not offer the same level of built-in parallelization capabilities.