From Upload to Insight: How I used LangChain and Pinecone to Analyze PDFs

Learn how to build an AI-centered application that gets users to upload PDFs and interact with their content through a conversational interface. Leveraging tools and technologies such as LangChain, Pinecone, and OpenAI's API, we can efficiently process large documents to address queries and tackle the challenges of building a scalable system.

I've always been curious about how we can use different technologies within AI to see how they interact with various files, most specifically with PDFs. There are now hundreds of upcoming projects live where you can search for a resource that will help you easily upload your PDF files into a chat interface, and have conversations with it. In most cases, the problem being solved here is being able to find an easy solution to quickly gain access to high volume of information and letting the chat do most of the work – meaning the dependency on using control+find doesn't have to be a frequent use any longer. It took me a while to understand at first, but it's important to outline what that process looks like behind the scenes. I'll share how I set up that system using LangChain and Pinecone, along with some other code snippets and lessons I learned along the way.

Understanding the Idea within the Project

The main goal is pretty straightforward: create a web application environment where a user can upload a PDF document and ask questions about its content. In return, there's an accurate response that is helpful and the conversation is continuous.

Flow of process within the experience: 

  • Sign up/Register or Log-in.
  • Process the uploaded PDF and extract the text.
  • Break the text into manageable chunks.
  • Generate embeddings for these chunks through OpenAI.
  • Store embeddings into a vector database for efficient information retrieval.
  • Build a conversational interface for user to ask questions and receive answers.

Getting started

There are a range of different tools and technologies that we can use when it comes to seeing what will help us execute this idea . I stumbled upon LangChain back when they announced integrations with HuggingFace at the time. It's basically a framework that's designed to streamline the dev process of applications that use LLMs. In other words, it's like a set of ready-to-go tools and components that make it easy to build applications capable of understanding and generating responses - a way to save time when it comes to building with efficiency, especially for those with non-heavy technical backgrounds.

It does take some time to read up into the documentation and learn about the features thought I promise the use is straightforward, I used this article to help me gain better understanding how what I need to get started.

The next item that I needed is looking for a vector database to help optimize similarity search.

When we convert text into embeddings using language models, we have high-dimensional vectors that capture the meaning of text. Imagine each piece of text as a point within a multi-dimensional space, where similar texts are located near each other. To efficiently find and retrieve these similar pieces based on what we use in our search, we need to design a system that handles this functionality.

Most databases aren't equipped to perform rapid similarity searches on high-dimensional vectors. They excel at exact matches; which is fine, but they struggle to find relatively "close" matches in a sea of numerical data. And this is where vector databases become essential–being specifically optimized for storing embeddings and performing similarity searches quickly. Hence how I ended up choosing Pinecone: a database that helps with storing embeddings, fast performance and it's easy to scale without a drop in performance.

To put it easier, think of Pinecone as a highly organize library shelf where books are sorted not just by title or author, but by the content and themes within them. When you have a question or wanting to gain information about something specific, Pinecone helps you find the book that's most relevant to your need quickly. You can get started on using it with the article provided–in my case, I'm using JavaScript installation.

Building the MVP approach

Here's what we need to install as part of our NPM packages:

npm install @langchain/core @langchain/embeddings-openai 
@langchain/document-loaders 
@langchain/text-splitter @langchain/vectorstores 
@pinecone-database/pinecone openai

Breaking these down: The Langchain packages simplify working with large models, handle document processing, and manage the interaction between OpenAI/Claude and Pinecone. While the Pinecone client packages are necessary for connecting with the vector database which is where the embeddings will be stored. I've also used Google Firebase for this project to help with robust backend data handling and storage, followed by using Clerk as my authentication preference and finally, having the web framework built off of Next.js.

Uploading the Processing PDFs

To extract text from the uploaded PDFs, I used PDF Loader from LangChain to help simplify the process.

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
// ...
const loader = new PDFLoader(data);
const docs = await loader.load();

Keep in mind, we'll want to use Next.js API and Firebase storage to set up a secure way for users to upload their files - this is where PDFLoader makes it straightforward. It reads the PDF and extracts the text which then helps us look at breaking it down.

To manage text effectively, we have to split them all into small chunks using LangChain's RecursiveCharacterTextSplitter. This will help us when it comes to efficiency for better processing and models that have token limits, as well as with relevance when it comes to retrieving the most relevant sections that our messages relate to.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// ...
const splitter = new RecursiveCharacterTextSplitter({  chunkSize: 1000,  chunkOverlap: 200,});
const splitDocs = await splitter.splitDocuments(docs);

The next big item once we're able to achieve both uploading of documents and text extractions, we'll want to use OpenAI to help with generating embeddings. You can picture these embeddings as numerical representations of text within the document that capture semantic meaning. This process involves overall sending text chunks into OAI's embedding endpoint and receiving the corresponding vector.

import { OpenAIEmbeddings } from "@langchain/openai";
// ...
const embeddings = new OpenAIEmbeddings({  openAIApiKey: process.env.OPENAI_API_KEY,});
const embeddingVectors = await embeddings.embedDocuments(splitDocs);

A closer look at what we're doing: We want to ensure that we iterate over text chunks, loop through each text that's been extracted from the PDF, make the API call, and the call returns with a response that represents the overall meaning, while we're still able to collect and store all embeddings. Keep in mind, there will be errors along the way some notable ones: API rate limits, text chunk size limits, handling API errors, potential cost management and data privacy concerns.

Once we're able to generate embeddings, we want to then perform efficient searches within the Pinecone index. Basically, by organizing these embeddings, the application is now able to quickly find and retrieve the most relevant text chunks. In the example provided below, I used document ID (docId) as a namespace in Pinecone to ensure that the documents and users aren't mixed up, a way to isolate and secure data.

import { PineconeClient } from "@pinecone-database/pinecone";

// ...

const pinecone = new PineconeClient();

await pinecone.init({
  apiKey: process.env.PINECONE_API_KEY,
  environment: process.env.PINECONE_ENVIRONMENT,
});

const indexName = "pdfchat";
const index = pinecone.Index(indexName);

import { PineconeStore } from "@langchain/pinecone";

// ...

await PineconeStore.fromDocuments(splitDocs, embeddings, {
  pineconeIndex: index,
  namespace: docId,
});

Building an interface

Since we're now able to build the core function for how the text chunks are uploaded, processed, received, and stored. We can surface this whole approach into a chat interface experience. Typically, we can go ahead and create a Chat.tsx file or Interface.tsx file, whichever your preference. Since it's heavily React based, it should be straightforward to to build out your PDF, results, and chat window. Keep in mind, you'll also. want to import Firebase or the db you're using, which could mean that you may need to change your .tsx file into "use client".

When handling user messages, areas to keep in mind: 

  • Capturing the user's questions: Double checking you can take the input from the chat interface and save it.
  • Updating the chat history: Based on limits or how users will navigate through the experience, we want to make sure we have a limit if needed.
  • Displaying the output result: Basically showing where the answers will be, you can also explore to it's capable of returning code or links, etc.

Sample code for how to handle user messages:

const handleSubmit = async (e: FormEvent) => {
  e.preventDefault();

  // Adds the user's question to the messages
  setMessages([...messages, { role: "human", message: input }]);
  setInput("");

  // Sends the question to the server and gets the response
  const response = await askQuestion(id, input);

  // Adds the AI's response to the messages
  setMessages((prevMessages) => [...prevMessages, { role: "ai", message: response }]);
};

As we start to put things together – on the server side, I used LangChain's retrieval and conversational chains to find the most relevant text chunks from the PDF and then generate an answer logic to the questions provided. This process involves a function that fetches the last few messages, using docId, and converting the data into a format that the data model can understand.

Begin with fetching the chat history

async function fetchMessagesFromDB(docId: string) {  
const { userId } = await auth();  
if (!userId) {    
throw new Error("User not found.");  }

And then we create a function that helps generate a response to the user's questions

const generateLangchainCompletion = async (docId: string, question: string) => {  
let pineconeVectorStore;  

pineconeVectorStore = await generateEmbeddingsInPineconeVectorStore(docId);  

if (!pineconeVectorStore) { 
throw new Error("Pinecone vector store not found.");  }

The job that's being assigned here is we double check that the embeddings for the document is available in Pinecone. If not, we can generate them using generateEmbeddingsInPineconeVectorStore, more details on VectorStores provided. And create a retriever that helps us find relevant chunks in responses to the user's query. Once we're able to capture retrieving that information, we're all set to also include capabilities of fetching chat history and the opportunity to further define prompt templates.

Example on how we can create the function for rephrase prompting

const chatHistory = await fetchMessagesFromDB(docId);

  const historyAwarePrompt = ChatPromptTemplate.fromMessages([
    ...chatHistory,
    ["user", "{input}"],
    [
      "user",
      "Given conversations, generate a search query to look up in order to get info relevant to the conversation.",
    ],
  ]);

  const historyAwareRetrieverChain = await createHistoryAwareRetriever({
    llm: model,
    retriever,
    rephrasePrompt: historyAwarePrompt,
  });

  const historyAwareRetrievalPrompt = ChatPromptTemplate.fromMessages([
    [
      "system",
      "Answer the user's questions based on the below content:\n\n{context}",
    ],
    ...chatHistory,
    ["user", "{input}"],
  ]);

  const historyAwareCombineDocsChain = await createStuffDocumentsChain({
    llm: model,
    prompt: historyAwareRetrievalPrompt,
  });

  const conversationalRetrievalChain = await createRetrievalChain({
    retriever: historyAwareRetrieverChain,
    combineDocsChain: historyAwareCombineDocsChain,
  });

  const reply = await conversationalRetrievalChain.invoke({
    chat_history: chatHistory,
    input: question,
  });

  return reply.answer;
};

Checking security and managing user data

Since we're building a web application that handles user-uploaded content and functionality that interacts with external APIs, we need to also understand about protecting user data and overall information that's passed along the APIs. For authentication, I use Clerk which is easy to integrate and is capable of handling password encryption, session management, and other general security tasks.

I've implemented it using pre-built components such as using <ClerkLoaded /> in the layout.tsx file and ensuring that authentication of user sign is at the top before any other job is loaded within the experience. Example provided below:

import { auth } from "@clerk/nextjs/server";

const { userId } = await auth();
if (!userId) {  
throw new Error("User not authenticated.");}

When protecting your API keys, typically these are stored in your .env file and the common naming convention you'll see will be similar to OPENAI_API_ KEY or OPENAI_ SECRET_ KEY, etc. The reason why we want to set it up this way is so that these APIs aren't exposed to the browser and only run on the server-side as a way to prevent unauthorized access and also just to comply with data privacy usage.

Challenges and what I've learned

Building this project was an experience that came with a fair share of challenges. Each obstacle of course provided an opportunity to learn and refine, especially when it came to handling large PDF sizing.

For example, processing large files efficiently is crucial and it needs to ensure that it doesn't run into memory issues. Which is why using RecursiveCharacterTextSplitter helps break down the document text into manageable chunks. Not only does it respect boundaries but also checks that each chunk is within the token limits of the language model.

Another overarching challenge also was thinking about how the langchain.ts file alone would act and support to orchestrate, bringing elements together, generate embeddings, and interact across different technologies all at once. One way to help support this would be to think about different types of functions such as creating helper functions, and then there are the main functions, and lastly how those functions would be supported with error handling. Think of it as creating better modularity across what we are writing – it's broken down into small and reusable sections that not only enhances maintainability, but also readability. This approach also helps think about flexibility and when the time comes that we want to add other features without going back and creating significant rewrites on what's already existing.

Building this project was a journey of learning and problem-solving for me personally. From handling large documents to ensuring security and performances are met, at the end – it was fun and hopefully this article is useful as you start to build your own.