Finetuned LLM for text-to-SQL responses

In this project I will develop a chatbot to generate SQL answers, combined with this we will perform a finetuning with the idea of reducing the token consumption and also improve the results quality. We can make all of this in a simpler way by using a platform such as Cohere or OpenAI, but in my case I will use Lamini, because it gives us the option of finetuning other LLMs such as Llama 3.

Github Repository

Disclaimer In this project I will use the free tier of Lamini to make this project accessible to anyone interested in experimenting with Finetuning, in case of need for a production-ready app, I recommend using other platforms or upgrading to the paid tier to access more resources such as memory tuning to keep hallucinations almost 0 and with faster results.

Description

As mentioned before, the idea is to create a chatbot that will generate SQL answers, to design the frontend interface I will use Streamlit as It’s now the easiest way of creating a web app. In this project I will create a simple chat app without semantic routers or filters, for a more complex chat app I recommend reading my article about chatbots with sources and filters.

There are multiple types of fine-tunings that we can do (copilot, routing, etc). In this case, because the idea is to have a chatbot we will perform Instruction Finetuning, this types of finetuning allow the model to follow instructions. An easy way of understanding this is by using Openai, An instruction finetuning is what makes GPT4-o (The brain) be ChatGPT.

In this project, we are going to take a model of 410 Million parameters, and we will check if the result it provides can be as good as one with more parameters, like GPT4o.

Before running the finetuning, let’s run the code without any training, let’s see the output:

import lamini
lamini.api_key = "YOUR API KEY"

llm = lamini.Lamini("EleutherAI/pythia-70m")
print(llm.generate("How can I retrieve the names of employees who have the highest average performance score in their department and also attended at least three training sessions in SQL?", output_type={"Response":"str"}))

Which gives back the next response:

I’m not sure how to get the names of employees who have the highest average performance score in their department and also attended at least three training sessions in SQL?

This response is far from what we are looking for, let’s see what we get with finetuning.

Data Preparation

When working with LLMs it’s important to have good quality data to use, in this case I will use synthetic data, that basically means that it will be generated by another LLM that is more powerful that the one we are finetuning. The data will be Text-to-SQL. There are multiple datasets that we can use that are currently available in places like Huggingface in case we don’t want to use our own data or generate fake one.

The data has to follow a particular structure depending on the platform we are working on, most of the times is json data which specific keys. In the case of Lamini the structure is the next one:

Now let’s see the steps we need to follow to prepare this data:

Collect the question-answer pairs: This is the information about what the user is asking and the expected response.
Tokenize: This means turning the data into numerical values for the training, we will see this in the next step.
Split test/test: We want to have test data to evaluate how good our model is.

1. Question-Answer pairs

As mentioned before, I will generate data using a more powerful LLM, GPT4o via ChatGPT, and create 100 rows of data that we will pass for the training.

The prompt I used is, feel free to create your own:

I'm performing fine-tuning for an LLM and I need the data to follow a certain 
structure. The idea is that the user will introduce a question about what they 
want to do, and the LLM will generate an SQL response. Can you generate multiple 
examples of simple and complex SQL queries that follow that structure? 

Please make around 100 examples.

This is the structure:

 {
    "input": "How do I select all columns from a table named 'employees'?",
    "output": "SELECT * FROM employees;",
 },

2. Tokenize

In simple words, tokenizing means turning sets of characters that are split in a sentence into numbers, this is important for the LLM to generate a response and also it’s why when we use the wrong tokenizer, our LLM will be really “confused”.

Before continuing let’s see what an encoding looks like to explain an important concept.

To tokenize in a really simple way we will use the transformers library that will be able to find the right tokenizer for our model:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

# Text we want to encode
text_1 = "Hello, hope you are having a great day!"
text_2 = "Today"

# We encode the text
encoded_text_1 = tokenizer(text_1)
encoded_text_2 = tokenizer(text_2)

# We print the output
print(f"The ecoding for the first text is: {encoded_text_1["input_ids"]}")
print(f"The ecoding for the second text is: {encoded_text_2["input_ids"]}")

Which gives us the next response:

The encoding for the first text is: [12092, 13, 3524, 368, 403, 1907, 247, 1270, 1388, 2]
The encoding for the second text is: [14569]

As we can see we get the encoding fast for any specific model, but there is a problem, when we pass the encoding to our LLM for the training, we need them to have the same length, the first sentence is longer than the second, which means more tokens, but because we are working we tensor, we need the same amount of “tokens” for the training.

There is a solution for this called Padding tokens, which make them have the same length, let’s implement it:

First, we need to specify the tokens we want as padding, there are multiple ways of doing it, in this case, we are using .eos_token that will add 0 to the tokens.

tokenizer.pad_token = tokenizer.eos_token

Now we can just continue tokenizing with the extra property of padding:

# We pass the texts in a list for the method to work
list_texts = ["Hello, hope you are having a great day!", "Today"]

encoded_texts_longest = tokenizer(list_texts, padding=True)

If we print the result now, we get:

The encoding for the first text is: [12092, 13, 3524, 368, 403, 1907, 247, 1270, 1388, 2]
The encoding for the second text is: [14569, 0, 0, 0, 0, 0, 0, 0, 0, 0]

With all of this, we just have to address another important concept called truncation our model has a limit on the number of tokens it can digest, to not exceed that limit we will use a truncation method from this library, combining it with padding and we are ready to continue:

encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)

We can do all of this for our data, let’s choose the first example and see the process:

example_data = {"input": "How do I select all columns from a table named 'employees'?",
     "output": "SELECT * FROM employees;"}

text = example_data["input"] + example_data["output"]
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)

print(tokenized_inputs["input_ids"])

We get:

[[ 2347 513 309 3609 512 9930 432 247 2829 4907 686 10221 4317 46146 12995 475 10727 6171 28]]

Now we should check if we need truncation, we have to define the max length of the model we are going to be using:

max_length = 2048

max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)

tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)

Let’s tokenize our dataset, first, let’s create a function to tokenize the file:

# This function is a modified version of the one created by Lamini
def tokenize_function(dataset):
    text = dataset["input"][0] + dataset["output"][0]

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

Now let’s tokenize the file:

filename = "sql_data.json"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()

finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

The only thing left is to split into training and testing the dataset:

split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)

Finetuning

Once the data is ready, we just need to start the process of training our model. We are going to use the Lamini Python SDK so we don’t need to worry about the encoding and decoding unless we want to do the inference manually. The data used for the training it’s available in the GitHub repo.

from lamini import Lamini   
import json

llm = Lamini(model_name="EleutherAI/pythia-410m") 

file_path = '/content/sql_data.json'

with open(file_path, 'r') as f:
    json_array = json.load(f)
  
dataset = list(json_array)

llm.train( 
    data_or_dataset_id=dataset, 
    finetune_args={ 
        "max_steps": 300, 
        "early_stopping": False, 
        "load_best_model_at_end": False, 
    }, 
)

When we start the training, Lamini will give us a dataset_id, which we can reuse in the future if we want to use the same data for another training. In the dashboard of Lamini we can also check how much time remaining until the training is done.

Results

When the training is completed, let’s see the performance with the evaluation rows.

Query: How can I create a trigger that logs changes to the ‘salaries’ table?

Expected output

CREATE TRIGGER log_salary_changes AFTER UPDATE ON salaries FOR EACH ROW BEGIN INSERT INTO salary_changes_log (employee_id, old_salary, new_salary, change_date) VALUES (OLD.employee_id, OLD.salary, NEW.salary, NOW()); END;

Base Model

I'm not sure how to get the names of employees who have the highest average performance score in their department and also attended at least three training sessions in SQL?

Trained Model

CREATE TRIGGER log_salary_changes AFTER UPDATE ON salaries FOR EACH ROW BEGIN INSERT INTO salary_changes_log (employee_id, old_salary, new_salary, change_date) VALUES (OLD.employee_id, OLD.salary, NEW.salary, NOW()); END;

GPT4o (Specified to only provide the SQL response)

-- Create the logging table
CREATE TABLE salary_changes_log (log_id INT AUTO_INCREMENT PRIMARY KEY,operation_type VARCHAR(10),employee_id INT,old_salary DECIMAL(10, 2),new_salary DECIMAL(10, 2),change_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP);

-- Trigger for INSERT
CREATE TRIGGER log_salary_insert AFTER INSERT ON salaries FOR EACH ROW BEGIN INSERT INTO salary_changes_log (operation_type, employee_id, old_salary, new_salary) VALUES ('INSERT', NEW.employee_id, NULL, NEW.salary); END;

-- Trigger for UPDATE
CREATE TRIGGER log_salary_update AFTER UPDATE ON salaries FOR EACH ROW BEGIN INSERT INTO salary_changes_log (operation_type, employee_id, old_salary, new_salary) VALUES ('UPDATE', NEW.employee_id, OLD.salary, NEW.salary); END;

-- Trigger for DELETE
CREATE TRIGGER log_salary_delete AFTER DELETE ON salaries FOR EACH ROW BEGIN INSERT INTO salary_changes_log (operation_type, employee_id, old_salary, new_salary) VALUES ('DELETE', OLD.employee_id, OLD.salary, NULL); END;

As we can see the improvement in the model it’s substantial as now we get a result that matches what we are looking for with only 100 rows and doing a simple finetuning. This can be improved by using a bigger model, more data, and more advanced techniques such as QLoRa or Memory Tuning.

Thanks for reading!

Description#

Data Preparation#

1. Question-Answer pairs#

2. Tokenize#

Finetuning#

Results#