Building my first RAG App

4 min readSep 4, 2024

Summary

I coded a simple RAG App on Jupyter Notebook to answer queries based on one of my reports. I started with exploring the Wikipedia retriever, then followed by building my retriever using a pdf-loader, text-splitter, OpenAIEmbeddings(), and Chroma for vector database. I gained some further understanding of the cost involved when using the RAG App.

Introduction

This NVIDIA article gave a pretty good overview of what RAG is. To summarize, the llm model may require a good external context to provide an expert answer based on the question. LLM Applications built to utilize the RAG technique often involve the process of

loading large amounts of external (sometimes proprietary) information,
converting them to a language that is easier for LLM to process (a process called embedding),
storing the info into a vector database, and
use the database as part of the prompt (as a retriever).

What I believe is working behind the scenes:

The retriever will find the chunks of information in the vector database that is most relevant to the question, and pass the chunks as context together with the question to the LLM model, which allows the LLM model to answer the question based on the context.

If You Thought of Building an App with Wikipedia

You don't need to perform the process of loading, embedding, and storing in a vector database as there is already a Wikipedia retriever available Link. The Wikipedia retriever is free. Using the LLM modal to answer specific questions based on Wikipedia will still cost you because the context needs to be fed into the model.

Based on a sample wiki app that I built here, I would say the answers provided by the LLM based on Wikipedia are quite accurate. However, it took about 8k tokens to answer the questions given.

Example response for questions-related to Malaysia

Statistic on total tokens and cost used for the three questions

A Simple RAG application

Summary of steps of what I did. You can see the full notebook here.

Load one of my reports using the PyPDFLoader.
Split the documents into chunks using the standard RecursiveCharacterTextSplitter.
Embedding the chunks and store them into a vector database.
Explore using the database for similarity_search and as a retriever.
Chain the retriever with prompt, model, and output parser.

Some further details:

Embedding is not free if you use OpenAI embedding models. The cheapest model available (text-embedding-ada-002) costs about one dollar per million tokens embedded. For my report, it used about 6k tokens for embedding.

Chroma is one of the useful classes from LangChain that help to create/update the vector store. The persist directory option allows you to save your vector store on a hard disk location so that you can retrieve it later.

An example of how to use Chroma. splits is text chunks split earlier from the document loaded

File structure of what is actually being saved by Chroma

There are two approaches for the search and retrieve step:

similarity_search — will return the chunk of text most relevant to your search.
as_retriever — performs similarly to similarity_search, except you can use the retriever object as a component in a more complex LCEL (LangChainExpressionLanguage) operation.

Example chunk of text returned by the similarity search

Chaining the vector_database as a retriever in part of a RAG application using the LCEL allows us to get a more precise answer based on the context retrieved from the retriever. The next image shows the sample response I get by asking the same question “how is the convolution neural network used by the author?” as above.

Sample response from the model utilizing the vector database as a retriever

The Cost Implications

A basic RAG application will have two costs related to LLM usage. The initialization cost (for converting documents/information into embedding) and the query cost (for questioning the LLM model and retrieving the answer).

The initialization cost is likely to be relatively small and once-off. It only costs one dollar to convert one million tokens into embeddings, which is more than enough for a book. Further, the embeddings can be stored and shared across which further reduces the cost per application.

The query cost is of interest here. For my app, I sent 3 queries related to Wikipedia to OpenAI's cheapest model (GPT-4o-mini), and the total cost is about 0.14 cents. The cost for a better model could be twenty-fold hence each query could cost about $0.10.

Whether the cost is worth it depends on what applications we are building. My opinion is if we are building applications replacing/reducing human resources required (such as customer service representatives), then the initialization cost and query cost will be small compared to the wages that save. Otherwise, only certain queries can bring large value enough to justify its cost.