Unlocking Retrieval Augmented Generation (RAG) for Domain-Specific LLM chatbots

Introduction

all right hi everybody my name is delini sumanapala i've also been coming to a lot of these mindstone events and it's really cool speaking to um speaking to how speaking to everyone about how they're utilizing ai in their day-to-day and also for you know enterprise level solutions and whatnot there's such a scale there's such a different such a broad spectrum of people who are using uh LLMs and AI tools for the work that they're doing and it's very, very cool to keep attending these things. And it's great to be presenting today as well.

Topic Overview

So today I'll be talking about unlocking retrieval augmented generation for domain specific LLM chatbots.

Personal Journey into AI

So before I go into that in a huge amount of detail, unsurprisingly, like a lot of us in the room right now, I kind of got into AI from a non-AI background.

So I used to be a neuroscientist and my background was in action observation. That's the domain in which I did my PhD.

1So action observation concerns itself with how we learn from watching others and what brain regions are involved in basically letting us learn and understand what people are doing and potentially also then transforming those actions and being able to do those actions ourselves.

And so big, big part of my PhD was working with fMRI data, and that was kind of the, honestly, the scariest way to get into machine learning. I would not recommend fMRI data as the first place to start when it comes to learning machine learning. But, you know, jumped in at the deep end, and it worked out very well.

But once I was wrapping up my PhD, I really got the impression that I was very attached to the methodology and kind of a lot more curious about all the different kind of practical applications of machine learning outside of just using it within a neuroscience context specifically.

Transition from Academia to Industry

So once I left academia,

Functional, functional. Yeah, I'll break that down.

So when you have your generic MRI scans, when you go to a hospital, a lot of us have still images, still three-dimensional images that are captured when we have like medical slash clinical scans. But what we do in research, we take movies.

So we ask you to do things in a scanner. Sometimes it's watching things, sometimes it's doing things. And then we record movies of you doing those activities. And we look at the change and the patterns in the brain. So that's what the functional bit is about.

So anyway, so I left academia and I decided I wanted to pursue, I wanted to see what was out there in terms of being able to apply these skills in an industry kind of setting. And I attempted all of this right before the pandemic. So that was fun.

Pivot to NLP and Industry Application

During the pandemic, I was able to work with some small businesses and sort of come up with these different use cases for machine learning based on their specific needs. And then during this period, I also had a lot of businesses be like, we're working with a lot of unstructured text data. We have lots and lots of unstructured text data living in our... Company, we don't know how to do anything with it. Can you help us? So that was my segue into NLP. So for a little while, I was doing work on using natural language processing.

Very shortly thereafter, we had the ChatGPT slash LLM explosion. Suddenly all of the events in London that were about NLP became entirely about large language models, or most of them became about large language models. It was a very exciting time.

I started playing around with all of these different types of orchestration tools that were being released.

Building a Domain-Specific LLM Chatbot

And as I was doing all of this, as I was networking in the space, a biotech company approached me to build a chat bot, build a chat application that can speak specifically to biotech documents. And I was fairly honest with them. I told them, listen, I've just been playing with these tools at a very small scale. I don't know if I can necessarily build this for you in a way that will be useful for you, that can speak to very, very specific domain-specific data.

I'm happy to give it a shot. If you guys have the budget, I'm here. Let's do it. So essentially what happened was they said, okay, well, we have the budget. We like your pitch. Why don't you go off and do this? Um, and then I built them a product. I built them a chat application prototype that can speak to their particular documents.

Now I'm not going to talk about their use case today, but I will talk about something very similar that I built that, uh, speaks to that speaks with documents for my research domain, which is action observation. So.

Again, just to go over the things that we want to keep in mind when we build domain specific chat applications. 1Terminology can be extremely, extremely specialist. But again, terminology can be a big barrier to building something like this because it's not going to be a lot like the generic data that you can scrape from the internet. They can be very, very specific.

So early on, I think when the transformer models were first released, like around 2017 odd, there were companies in London that were Getting on that train, a lot of legal tech firms that really wanted to utilize these models to create something that can use natural language to communicate with these kinds of documents.

Again, most of you in the room, I'm pretty sure, are familiar with hallucinations. So even now, when you're using ChatGPT, you can ask questions. And it's very unlikely, when you use ChatGPT or any of these proprietary slash open source models, very often you will get very, very convincing answers. Very, very convincing answers.

Oh my god. Yesterday, I was looking for code. I was working on this particular application. And I asked a question of ChatGPT, and it gave me a very, very convincing coding sample that is entirely fictional. So I'm sure many of you have come across similar things working with large language models.

Occasionally I come across posts online that talk about how as context windows for large language models get bigger and bigger, so the more data you can pass into a large language model, theoretically, the easier it is for you to be able to ask questions from very, very high volume data. But at the same time, you want to be careful that, especially for proprietary models, that's a very expensive thing to do. So just because you can pass in an entire book or an entire database into an LLM to ask questions, if you keep doing that repeatedly, that can be very, very, very costly.

Understanding Retrieval Augmented Generation (RAG)

That's where retrieval augmented generation comes in. So some of you might be familiar with this, but I will go over the basics of it.

So the idea is that you can connect the large language model of your choice to a specific data set. And this particular application that you build will answer questions just about the data that you want.

Now, I know sometimes people refer to this as sort of like enhancing or kind of super power, like giving your LLM superpowers. But it's not quite. It's actually kind of restricting the LLM. You're restricting the LLM to say, listen, I only want you to answer questions about the data that I'm giving you. I don't want you to answer questions about other things that are not pertinent to my use case.

So one of the ways that we do this is that we use vector databases to select only the context that can be used to answer specific questions. So if you're asking a question about a book, you don't pass the entire book. You basically use vector databases to select just the context that is relevant to the answer that you want to get, to the question that you're asking.

And it also saves you quite a bit in terms of the amount of tokens that you expend, the amount of tokens that you use when you're using specifically proprietary large language models. But if you're using open source language models, it can also kind of help with latency and whatnot in terms of just having an application that can process this information very quickly.

And another thing which is very interesting is that when you pass a very, very large body of data to a large language model, there is something that happens where the middle of the body of data that you've passed doesn't get as much attention as the stuff in the beginning and the stuff at the end. And I actually, as a neuroscientist, I have, I speculate, I understand, I speculate that I know why this might be happening. which is that this is a learnt behavior that LLMs pick up just from how we humans tend to work with data ourselves. in that we have something called a primacy effect, where we recall the beginning of something, and we have something called a recency effect, where we remember the end of something. And we have like fuzzy memories of the middle, and we apply this way, this cognition, this style of cognition, to many, many things in our lives. The experiences that we have, the books we read, the movies we watch, the data that we work with, the research papers that we're reading, et cetera, et cetera.

So I think, I would say, It's the statistical behavior that LLMs are picking up. Completely speculative.

And in a simple way, one of the big advantages of being able to build something like this is you can build it fairly rapidly. with a lot of these new orchestration tools, and the one that I've used is very popular. It's called Langchain. Langchain allows you to build these applications quite quickly and play around with your pipeline quite a bit, experiment with different types of ingestion pipelines, and you might get something that you can actually get some decent use out of.

The Mechanics of RAG Explained

Here is, I'll try to go over this fairly quickly, but this is the overarching hierarchical view of what's happening under the hood with the demo that I'm going to show you. So let's start here.

Document chunks, let's assume, for the sake of simplicity, that these are all lovely, pretty, pre-processed, perfectly readable text formatted dot text files that are in their chunks, and they are taken from research papers, and they are converted to embeddings using something like OpenAI's embeddings model. So what are embeddings? Embeddings essentially are numerical representations of those chunks.

And those embeddings will be saved in a vector store. Why use embeddings? Embeddings let us index the similarity of documents to each other. So the embeddings for two documents that are fairly similar to each other should have vector indices that are fairly similar to each other.

So if you can imagine a three-dimensional space with three dimensions, the documents that are very similar to each other, their indices live close together. and documents that are very far apart live quite far apart. So they live in our vector store.

Now vector stores, they can be offline, they can be small, they can live on your laptop, they don't have to take up a lot of space, or they can be huge. They can be the size of the vector store that's powering Twitter's Grok. And there are lots of different vendors that provide vector storage services.

So the demo that I'm showing you today, it is a cloud. It's based on a cloud vector store. It's not particularly big. In my case, it's only about 100 megabytes. But they can get very, very big.

So now we've covered our embeddings. Our documents are here. So now we have a question.

Somebody that uses my demo will ask a question about this particular field of research. That question also gets converted to embeddings, and those embeddings are used as a semantic search to pick out the most relevant chunks from all of the documents that can be used to answer the question. So then those chunks are returned as ordered

And then the LLM is basically told to assemble an answer from the best chunk. So here's a bunch of chunks. Some of them are better than others. We have them in order. Can the LLM construct an answer to the user's question based on the chunks? And then hopefully, hopefully, we get a nice pretty answer at the other end.

All right, so hopefully that was clear.

Research Background of the Demo

very, very quickly, just to give you an idea of what the research background is. So this is ultimately what the demo is designed to answer questions about.

Raise your hand if you're a social neuroscientist by any means if you're in this room right now. Because if you're not, then you'll just have to trust me when I say that, oh, the answer is good, the answer is bad, yada, yada, yada.

So the thrust of this is that we have We have these brain regions that come online quite a bit when you're observing other people. So we have, you don't have to remember these names, middle temporal gyrus, we have inferior parietal sulcus, and we have supplementary motor area. So those are kind of the big key terms.

And the main takeaway from the whole field is that some of the regions that come online, when you're also doing stuff, overlap with some of those observation regions. And that's the whole spiel. That's all it really is.

There's a lot of niche debates going on in this field. That's what the entire field is about. All right.

Demonstration of the RAG Chatbot

Now I will show you my little demo for my little research bot that can answer questions.

So here, this is a Streamlit application that I created. Let's refresh it for a bit.

I can start by connecting it to my quadrant vector store that is not particularly big. It's 100 megs.

And let's ask it a question. So the first question I will ask is, what is the action observation network? And so it should be using, yeah, here we go. So now we've got this answer here. And let's see what it has to say.

So the action observation network is a system in the brain that is both active when individuals observe actions being performed, yada, yada, when they perform actions themselves. Then we have mention of the bilateral premotor and parietal cortices. That's all correct, allows individuals to internally simulate observed actions with their own sensory motor system.

So that's a fantastic summary. It's also highly debated, but it is a great summary of the literature that's there. The AON is modulated by one's expertise. That is true. So based on your level of experience with movements, with what you're watching, with what you're seeing, that can modulate the activity in the region as well. So this is a fantastic summary of an answer.

Let's try something a little bit more abstract. So again, this chatbot is talking to research papers. These are very, very dense. They are very, very dense. They have a lot of terminology. And we're trying to use ChatGPT to extract some pretty niche information. So let's see if it can answer a question like, what are the statistically significant findings in The domain of action observation network.

Now, the reason I'm asking this question is that in order to understand what statistical significance is, you should theoretically have gone through quite a few classes on statistics. It's a very, very specific understanding of statistical significance that helps you to parse through these papers. So some of you in the room probably are familiar with it, but I won't go over it in too much depth.

Thinking, thinking, here we go. All right, so statistically significant findings in the domain of the action observation network include consistent activation in the bilateral premotor and parietal cortices. That is true, known as the AON. And it goes into specific depth about activation in the Broca's area, BA44, BA45, differs between observation and imitation tasks. That's a key finding. So we're not saying that the level of activation is the same for both things. We're saying that there's some. There's some difference there. Then we have some details about imitation, recruiting the caudal ventral part, BA44. Additionally, the posterior middle temporal cortex. This is all, you know, if I was to grade this, if one of my undergrads from back in the day were to put this together, this would easily be like an A grade summary of what's happening in this field. So I would give this a fairly good grade in terms of how it's answering questions.

Capacity and Expansion Plans

Yes, please. of the Associates. So this is a small one. This is five big research papers. The main use case that I built was 1,500. But that's because I wasn't paying for the vector storage. So this I'm keeping quite small because it allows me to use the quadrant free tier, which is one gigabyte. And as far as I can tell, I could probably take it up to about 40 to 50 PDFs under the free tier 1GB cloud storage.

Integrating Multimodal Models

Sure. That's a very good question. So that's a fantastic question. That is what I want to do next. So there's all these multimodal models now. So there are pipelines that I believe Haystack and probably also LandChain as well now have that take in images and kind of have details about where they appear within context and then pass the images to like ChatGPT or another LLM, annotate what's in the image, and then add the text annotation to the middle of the body of the text. and then process the whole thing. Here, I mean, of course, if I had time, I would have done that. But here, the good thing about research papers is that you won't find images that are standalone. There are never any images that are standalone. They will always have a big chunk of annotation. They'll always be discussed in the body of your text. You can't let pictures hang in the air.

So anyway. OK, so those are questions that are relevant. Let's see what it does if I ask very silly things. OK, let's try.

Testing the Chatbot with Unconventional Questions

So what is the air speed velocity of an unladen swallow? Anyone? It's a reference to Monty Python, is what it is. It doesn't know.

It doesn't know. So it's not that ChatGPT doesn't know. If you ask this question to ChatGPT, it'll probably give you an answer based on reference to Monty Python, yada, yada, yada, velocity. It'll come up with some scraped information from the internet.

But here, we clearly have a case of, oh, well, it's not in the data set. It's not going to be discussed.

Red Teaming: Testing the Limits of the RAG Chatbot

Let's try again. Let's try a red flag. Let's try kind of this is called red teaming, where you ask questions to your RAG pipeline to get to see if it can kind of give you back answers being like, OK, I don't know. The answer to this is not in the context.

So this is called red teaming. So let's say I ask, what is the role of the thalamus in the action observation network? So that's still a brain region. It is definitely active, but it doesn't play a specific role.

So here we have the answer, or the answer that it's going to give us based on the documents that we provided, which is the thalamus is not explicitly mentioned in the provided context regarding the AON. Remember, these papers are dense. There are hundreds of brain regions, like very specific regions that are indexed all the time. And the thalamus may have cropped up somewhere just in terms of

situating some kind of activation. So this is a territory where absolutely you could potentially get a hallucination if you didn't have these guardrails in place. So I would say this would be a handy tool to use if you are If you are working with RAG, if you are working with very domain-specific papers, you can build something like this.

And I would encourage you to spend quite a bit of time to make sure that it's behaving the way that you want to. But yeah, for my purposes, this works pretty darn well. And I just wish I had it in 2013 when I was doing my PhD, and I didn't. And I had to read everything from scratch.

It was really bad. But yeah, that's the demo. Thank you so much.