Vector Database for an LLM: Diabetes Information

In this Python project, we will download diabetes information from MedlinePlus, a reputable website from the United States government. In the multiple documents, there will be a variety of content around diabetes, like descriptions, symptoms, and related conditions.

All of that data will be vectorized in a VectorDB, Pinecone, in order to perform a semantic search in the data.

Disclaimer This project is intended for educational and exploratory purposes only. It aims to facilitate data analysis and to illustrate potential future applications of AI in medicine. Please note that I am not a medical professional, and the outputs generated by this project should not be considered as medical advice. For any medical concerns or decisions related to diagnosis, treatment, or care, it is crucial to consult with a qualified medical provider or a healthcare professional. They can provide personalized guidance and make informed decisions based on your specific medical history and needs.

Description

This project will consist of a vector Database, all the vectors will be in Pinecone, and the content of them is information on diabetes. We will use those to perform an RAG to feed our LLM, in this case, Cohere Command R.

First, we will download the information, then we will proceed to chunk it to vectorize and upsert the data into our pinecone database.

This project is centered around the creation of the VectorDB and using a Cohere simple system for RAG, so the data will not be deeply cleaned.

Installing

The first step is to install all the libraries needed for the project

!pip install -qU langchain langchain_community tiktoken \ 
    unstructured tqdm requests bs4 cohere pinecone-client

Importing

During the project we will need a series of libraries, we will import all of them at the beginning.

# Web Download
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import os

# Langchain and tokenizer
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken

# Progress bar
from tqdm.auto import tqdm

# Data management, control and hashing
import hashlib
import json
import time

# AI model
import cohere

# Vector DB
from pinecone import Pinecone, ServerlessSpec
import pinecone

Data Processing

The first thing we need to do in this step is to download the data from MedlinePlus, there are multiple ways of doing this process, in this case, due to the fact that Medline has a good structure and organization, if we enter the page for diabetes, we can see it contains a list of more pages around the topic.

My approach will be to pass the link for diabetes and download the content of all the websites inside it. For that, We will create a function.

In my case, I’m working in google collab, so I will choose a specific directory for this project.

def download_html_pages(url):
	# Fetch the HTML content of the given URL
	response = requests.get(url)
	
	if response.status_code == 200:
		
		# Parse the HTML
		soup = BeautifulSoup(response.text, 'html.parser')

		# Find all anchor tags (links) in the HTML
		links = soup.find_all('a', href=True)
		
		# Create a directory to store downloaded pages
		save_path = '/content/downloaded_pages' 
		
		if not os.path.exists(save_path):
			os.makedirs(save_path)
			
		# Filter and download HTML pages
		for link in links:
			href = link['href']
			
			# Check if it's a relative URL
			if urlparse(href).scheme == '':
				href = urljoin(url, href)
			
			# Download only HTML pages
			if href.endswith('.html'):
				# Get the filename
				filename = href.split('/')[-1]
				
				# Download the HTML page
				page_response = requests.get(href)
				
				if page_response.status_code == 200:
				# Save the HTML content to a file in Colab's temporary space
					with open(os.path.join(save_path, filename), 'wb') as f:
					
					f.write(page_response.content)
				
				else:
				
				print(f"Failed to download: {filename}")
	
	else:
	print("Failed to fetch the page.")

Once we have the function we can download the content from the link.

download_html_pages("https://medlineplus.gov/diabetesmellitus.html")

With all the data in the directory, we can proceed to load it into documents using langchain. The load can be performed in different ways, as stated before, in this project I’m focused on the data loading and retrieval, so will not focus on cleaning the data.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('downloaded_pages', glob="**/*.html")
docs = loader.load()

Now we can check the amount of documents that we have, in this case, the output is 29.

len(docs)

The docs contain the header that does not have useful information, so I will eliminate the first 1300 characters.

for doc in docs:
	doc.page_content = doc.page_content[1300:]

Once we have this ready, we can start chunking the data to vectorize, the idea is to chunk in a logical manner. For that, we need to consider the context window of the LLM and the amount of documents we want to retrieve.

Example of chunking Let’s assume we have an LLM with a context window of 5000, in that context window we have to fit the query, and the retrieved documents. So we can use 2000 tokens for the query and information for the LLM and then 3000 for the retrieved documents, returning 5 documents, will mean a token size of 3000/5, that is 600 tokens, so we would need to chunk in 600 tokens.

In this case, the LLM is cohere Command R, going to the documentation, the model has a context window of 128k. We have plenty of space to use for the retrieval. The chunk in this case of 1k tokens.

First, we will create a function to get an idea of the tokens that each document has.

import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

def token_lenght(text: str):
	tokens = tokenizer.encode(text, diallowed_special=())
	
	return len(tokens)

With this, we can get the average, max, and min of the tokes per document.

token_counts = [token_lenght(doc.page_content) for doc in docs]

print(f'Min lenght: {min(token_counts)}')
print(f'Avg lenght: {round(int(sum(token_counts))/len(token_counts))}')
print(f'Max lenght: {max(token_counts)}')

We get:

The minimum length is 143.
The maximum length is 7736.
The average length is 1811.

Now we can do the chunking:

text_splitter = RecursiveCharacterTextSplitter(
	chunk_size = 1000,
	chunk_overlap = 50,
	length_function = token_lenght,
	separators = ['\n\n', '\n', ' ', '']
)

For each chunk, we will create a unique ID, but to facilitate the update, delete, and identification, the idea is to have a uuid that will be linked to each file, and per chunk on each an integer.

For example for the first document, we get an uuid that is 5e66683f, and we divide it into 3 chunk, the different uuid will be:
5e66683f-1
5e66683f-2
5e66683f-3

m = hashlib.md5()
documents = []

for doc in tqdm(docs):
	url=doc.metadata['source'].replace('downloaded_pages/', 'https://medlineplus.gov/')
	
	m.update(url.encode('utf-8'))
	uid=m.hexdigest()
	chunks = text_splitter.split_text(doc.page_content)
	
	for i, chunk in enumerate(chunks):
		documents.append({
			'id': f'{uid}-{i}',
			'metadata': {
				'url':url,
				'text':chunk
			}
		})

Once we have chunked the documents, we can check the number of chunks that we have, and is 73.

Vector DB

For the vectorDB we need to convert each chunk into its vector to be able to perform a retrieval, in Pinecone we can also store metadata, that’s where we can find the chunk content that will be used by the LLM.

First, we have to initialize the Pinecone connection, and then create an index. When creating the index, a critical metric is the dimension, in order to know which number to use we have to check the embedding model that we are using. In this case embed-english-v3.0from cohere is 1024.

pc = Pinecone(api_key='PINECONE_API_KEY')

# We check if the index has already been created
if index_name not in pc.list_indexes():
	pc.create_index(
		name=index_name,
		dimension=1024,
		metric="cosine",
		spec=ServerlessSpec(
			cloud='aws',
			region='us-east-1'
		)
	
	)
	
	# Waiting until its created
	while not pc.describe_index(index_name).status['ready']:
		time.sleep(1)

index = pc.Index(index_name)

When this is initialized we can get the info of the index.

index.describe_index_stats()

There are different ways of uploading the data into the vectorDB, in this case, we will create a series of lists with the data.

content = []
ids = []
metadata = []
  

for i in range(len(documents)):
	ids.append(documents[i]["id"])
	metadata.append(documents[i]["metadata"])
	content.append(metadata[i]['text'])

to_upsert = list(zip(ids, response, metadata))

Once we have this created we can create a function to query the docs. First, we have to vectorize the query, check for the documents with the most similarity, and extract an index and the content.

def get_docs(query: str, top_k: int):
	xq = embed([query])[0]
	res = index.query(vector=xq, top_k=top_k, include_metadata=True)
	docs = {x['metadata']['text']: i for i, x in enumerate(res['matches'])}
	
	return docs

Now we can query an example using the function, getting the five top results:

query = 'What is diabetes'
docs = get_docs(query, top_k=5)

LLM Implementation

Once we have all the structure of the vectorDB we can proceed, with the whole pipeline, the user will have a query, we get from the vectorDB the documents with the most similarity for that query, pass those documents to the LLM, and get a response.

We can perform RAG in a simple way with cohere, as is another variable in the chat, we will need to adapt the format and do the execution. We will do two of them, one with RAG and another without it.

First, we perform the search in the vector DB:

query = 'What is diabetes'
docs = get_docs(query, top_k=5)

Now we structure the data:

documents_RAG = [
	{'title': str(value), 'snippet': key}
	for key, value in docs.items()
]

Now we can get the output using RAG:

response = co.chat(
	model="command-r",
	message="What is Diabetes?",
	documents=documents_RAG
)

If we want to repeat the same without the RAG, we just have to eliminate the `documents.

response = co.chat(
	model="command-r",
	message="What is Diabetes?",
)

Once we have that output we can print the output:

print(response.text)

Response with RAG: Diabetes is a disease in which the blood glucose levels are too high. Glucose is the body’s primary source of energy, which comes from the food that we eat.
In a person with diabetes, the body doesn’t make enough insulin - a hormone that helps glucose enter the cells - or doesn’t use it well. This causes the glucose to build up in the blood.
There are two types of Diabetes:
Type 1 Diabetes: The body does not produce insulin.
Type 2 Diabetes: The body does not produce or use insulin well. This is the most common type of diabetes. It is often associated with older adults, obesity, lack of physical activity, and family history.
People with diabetes need to monitor their blood sugar levels regularly and keep them at the right level, or they can develop health problems, including:
Eye disease
Foot problems
Gum disease
Heart disease and stroke
Kidney disease
Nerve problems
Sexual and bladder problems
Skin conditions
Diabetics also need to watch out for hyperglycemia (high blood sugar) and hypoglycemia (low blood sugar). Hyperglycemia symptoms include increased thirst and urination, blurred vision, and headaches, while hypoglycemia can cause dizziness, a fast heartbeat, and sweating.

Response without RAG: Diabetes is a chronic condition that impacts the body’s ability to regulate blood sugar levels. It’s characterized by either the body not producing enough insulin or the body’s cells not responding effectively to the insulin that is produced. Insulin is a hormone that helps transport glucose from the bloodstream into the body’s cells, where it’s used as energy.
There are two main types of Diabetes:
Type 1 Diabetes: The body does not produce insulin. This form of diabetes usually appears during childhood or adolescence, but can also occur in adults. People with Type 1 Diabetes require daily insulin injections or infusions to survive.
Type 2 Diabetes: The body either doesn’t produce enough insulin or doesn’t use it effectively. This type is often associated with lifestyle and can be managed with a healthy diet, regular exercise, and in some cases, medication or insulin therapy. It’s usually diagnosed in adults who are overweight, have a family history of diabetes, or have a history of insulin resistance, but can also occur in children if there’s a strong family history.
There are also other types of diabetes, including Gestational Diabetes, which some women experience during pregnancy, and secondary forms of diabetes that can be caused by certain medications, genetic disorders, or other health conditions.
Diabetes can cause various complications such as cardiovascular disease, kidney damage, nerve damage, and vision problems. Therefore, it’s important to seek medical advice and follow a treatment plan if you’re diagnosed with the condition.

The RAG document included some external conditions based on the other documents, the improvement is small due to the small size of the data, and that it has not been cleaned.

For the RAG we can extract the citations:

response.citations

Next steps

This project was a brief introduction on how to create a VectorDB, but the result can be improved adding more steps, and improving:

Cleaning the data and improving the preparation
Adding more data to have more context
Implementing a reranking system to get the better results
Using semantic router to speed up the process

Description#

Installing#

Importing#

Data Processing#

Vector DB#

LLM Implementation#

Next steps#