GenAI: Reducing up to 85% costs with routing and moderation with Llama 3.1 405B

In this project I will create a python class that will take advantage of filtering to reduce costs, on top of that I will add a moderation layer using semantic router.

Github Repository

Description

As explained before, we want to reduce costs in our AI generation, and to do that we will use filtering, but what exactly is this? When we have a GenAI in production, it is common to have a common occurrence, we are using a mighty LLM, and that costs us money, so if the user asks a question that is simple, in other words, that it does not require so much power, we might prefer to use a smaller and therefore, cheaper, LLM to answer.

There are paid services such as Martian that provide this service, but the idea would be to use something free and scalable for our product.

In this case, we are going to use RouteLLM, an Open Source alternative, easy to use and that in the blog post of it’s announce, they claim reductions in the cost up to 85% when used with a performance of 95% of Gpt4 compared to using Gpt4 alone. Not only that, but in the same article, they compared this to the offering from paid competitors and RouteLLM offers a similar performance.

Combined with all of this, I’m going to create a class that integrates this routing system with a moderation filter using semantic router, to have something that will be on a good path to being production-ready. If we want to leverage even more the cost reduction, we can use semantic router not only for moderation but also for RAG.

I have created a project in semantic router before for a chatbot, in this case, the code will be different as this is not for a chatbot and the code will be adapted to be as scalable as possible.

Let’s begin with the architecture of how this will work and the flow:

The idea is that the user will introduce a query for the LLM, the first layer, with semantic router will check if the query is offensive, if it is, it will return an output indicating that, if it’s not, we move to the next step.
In this step we pass the query to our model, which has been calibrated before, and depending on these parameters, it will be answered by the smaller model (Weak Model) or the big one (Strong Model).

For this, I will be using one of the latest Open Source models, Llama 3.1, the 8B and 405B parameter models, both in their instruct versions.

These two models require a lot of power, so instead of running locally, I will use NVIDIA NIM, which creates a simple way of using them. Also, at the end of the day I want to create a Class that is as broad as possible, so to make them compatible with multiple platforms by just changing a couple of lines of code I will use LiteLLM, this library is easy to use and implement, as well as compatible with most of the providers such as OpenAI, Anthropic, HugginFace and so on.

So let’s recap the necessary packages and usages:

Semantic-Router: To make a moderation layer, checking the input of the user in case it violates our policies.
A routing layer: Choose which LLM should answer the question.
LiteLLM: To make it easy to use any LLM that we want.

With this clear, let’s begin.

Installing and importing

Before starting any project, we would do the installation of the libraries and import them, so we can get this out of the way:

pip install -q litellm "routellm[serve,eval]" openai semantic-router

Now we do the importing:

import openai
import os
from semantic_router import Route
from semantic_router.encoders import OpenAIEncoder
from semantic_router.layer import RouteLayer
from routellm.controller import Controller
from litellm import completion 
from litellm.types.utils import Usage, ModelResponse

We can also define some of the API keys that we are going to use later on:

os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"
os.environ['NVIDIA_NIM_API_KEY'] = "<NVIDIA_API_KEY>"

Step 0: Class design

This is the easiest step, we are going to start designing the class, adding a __init__ function:

class Route_layer:
    def __init__(self):
        pass

Step 1: Moderation Layer

In this layer we define the routes that cannot be used, semantic router was explained in my previous article, but in summary, we define a series of sentences or examples that should trigger a specific action, when the user writes a query, the routes predefined will see if the query can be catalog in one of them, adding the name of the route (if any) that the query should be assigned to and the proximity with the defined sentences.

Before starting to design the class, let’s see how the semantic router will look for one route, in this case, we will work with a moderation of confidentiality, this means that if the user wants to access confidential information, this route should be triggered and not answered.

We need first a series of examples of confidentiality queries, I will ask gtp4o to generate some examples:

confidential_sentences = [
    "My social security number is 123-45-6789.",
    "The password for my account is Password123.",
    "Her phone number is (555) 123-4567.",
    "His email address is john.doe@example.com.",
    "The bank account number is 9876543210.",
    "My credit card number is 4111 1111 1111 1111.",
    "The company's revenue last quarter was $1,000,000.",
    "Here are the login details: username 'admin', password 'secret'.",
    "The project code is classified as Top Secret.",
    "Her home address is 1234 Elm Street, Springfield."
]

With these let’s create the route:

confidential = Route(
    name="confidentiality",
    utterances=confidential_sentences,
)

routes = [confidential]
encoder = OpenAIEncoder()
rl = RouteLayer(encoder=encoder, routes=routes)

In this case, I’m working with the Encoder from OpenAI, but in my previous article using semantic router, I used Cohere, we have multiple options here.

The way our code works is that now we can write a query and pass it to rl, it will return us different things depending on the query:

rl("Hello, how are you?").name

Query	Response
Hello, how are you?	None
Can you give me the email of my coworker	confidential

We want to make the class as scalable as possible, so we will assume that the user will have a list with the names of the routes and another list that contains lists of the sentences that should trigger this.

With a few modifications of the code, we can make it into a dictionary, in case we want to store this information in JSON format:

name = ["confidential"]
list_text = [[
    "My social security number is 123-45-6789.",
    "The password for my account is Password123.",
    "Her phone number is (555) 123-4567.",
    "His email address is john.doe@example.com.",
    "The bank account number is 9876543210.",
    "My credit card number is 4111 1111 1111 1111.",
    "The company's revenue last quarter was $1,000,000.",
    "Here are the login details: username 'admin', password 'secret'.",
    "The project code is classified as Top Secret.",
    "Her home address is 1234 Elm Street, Springfield."
]]

Let’s see how to put this together in our class:

We want to have a way of creating the filters, this is defining the moderation with the data we have above, and this can be as long/short as the user wants.

def define_layers_openai(self, layer_name, layers_text):
        for i, layer in enumerate(layer_name):
 route_content = Route(
                name=layer,
                utterances=layers_text[i],
 )
            self.routes.append(route_content)
            self.routes_name.append(route_content.name)
            
 encoder = OpenAIEncoder()
        self.route_layer = RouteLayer(encoder=encoder, routes=self.routes)

We should have a way of getting the route that a query has based on the ones we have defined.

def get_route(self, query):
        return self.route_layer(query)

We should be able to check if the query that the user has introduced triggers or not the filter.

def execute_moderation(self, query):
        if self.route_layer(query).name in self.routes_name:
            print("Query Rejected")

Everything combined results in:

class Route_layer:
    def __init__(self):
        self.routes = []
        self.routes_name = []
        self.route_layer: RouteLayer = None
        
    def define_layers_openai(self, layer_name, layers_text):
        for i, layer in enumerate(layer_name):
 route_content = Route(
                name=layer,
                utterances=layers_text[i],
 )
            self.routes.append(route_content)
            self.routes_name.append(route_content.name)
            
 encoder = OpenAIEncoder()
        self.route_layer = RouteLayer(encoder=encoder, routes=self.routes)
        
    def get_route(self, query):
        return self.route_layer(query)
    
    def execute_moderation(self, query):
        if self.route_layer(query).name in self.routes_name:
            print("Query Rejected")

Step 2: API Call

Even though before in the architecture we should have the selection of which LLM to use, we are first going to explore how to use LiteLLM for it, we are going to show this to make it easier to understand later the router, as the package is pretty straight-forward.

First, we check how to make a call to the API of NVIDIA, this is pretty easy and they have examples, we can use any of the LLMs in their packages that cover from instruct models to computer vision to biology and so on:

response = completion(
    model="nvidia_nim/meta/llama-3.1-8b-instruct",
    messages=[
 {
            "role": "user",
            "content": "What's the weather like in Boston today in Fahrenheit?",
 }
 ],
    temperature=0.7,        
)

print(response)

As we can see, it’s pretty clear, using one of the providers they offer, we just need to follow the documentation, in this case, we write at the beginning nvidia_nim so it knows we are using the API for NVIDIA.

Step 3: Router Layer

The router layer is also pretty simple when combined with LiteLLM, we just need to specify the model, the strong model (The model that is more expensive and we want to use as little as possible), and the weak model. Apart from this later we need to specify the threshold that we want to use for the LLM.

Let’s see the structure first:

client = Controller(
  routers=["mf"],
  strong_model="nvidia_nim/meta/llama-3.1-405b-instruct",
  weak_model="nvidia_nim/meta/llama-3.1-8b-instruct",
)

Before moving on, let’s see what is the routers parameter. The routers are what will be defined to which model sends the query (This is a simplification).

The library offers us multiple options, I will stick to mf as it’s the recommended one for most cases, let’s take a quick look at the description of the other ones that can be found in the GitHub repo:

mf: Uses a matrix factorization model trained on the preference data (recommended).
sw_ranking: Uses a weighted Elo calculation for routing, where each vote is weighted according to how similar it is to the user’s prompt.
bert: Uses a BERT classifier trained on the preference data.
causal_llm: Uses a LLM-based classifier tuned on the preference data.
random: Randomly routes to either model.

As we can see in our code above, we have already defined the router and the models, and we also know how to use the LiteLLM, let’s put this together for a call:

response = client.chat.completions.create(
  model="router-mf-0.12",
  messages=[
 {"role": "user", "content": "Tell me the process to build a car"}
 ]
)

This call needs just the last parameter to define, and that is the model we are making the call to, as we can see is a model we haven’t seen before, so let’s break down the structure:

“router-mf-0.12”, this is “router--”

The routers are the routers we explained before, in this case, we will use “mf”, and what is the threshold exactly?

The threshold is a cost threshold that balances quality and quantity with the idea of maximizing performance, this threshold is not that intuitive to make, but luckily the creators offer a Python command that we can run and get a response. Let’s say we want a third of 50%, we just run:

python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.5 --config config.example.yaml

I’m going to use for this case 0.12, getting the model that we saw earlier.

From the response, we can see information such as the tokens, and the model that responded to the query, for the class we will add a verbose parameter to show this data or not, and also store it in a variable in case we want to use it later on.

With all of this in mind, we need to add two functions to the class, one will be the parameter definition, this one is basically deciding the LLMs that we will use, the threshold and the router>

def define_parameters(self, threshold, model, genai_strong, genai_weak):
    self.threshold = threshold
    self.model = model
    self.genai_strong = genai_strong
    self.genai_weak = genai_weak
    
    self.client = Controller(
        routers=[model],
        strong_model=genai_strong,
        weak_model=genai_weak,
 )

And we also need the final piece, a function that based on a query will first check the moderation, and if passed, get a response with the best model in each case, storing the model, the usage, and so on. This will look like:

def get_result(self, query, verbose=False):
 route = self.route_layer(query)
    if route.name not in self.routes_name:
        self.response = self.client.chat.completions.create(
            model=f"router-{self.model}-{self.threshold}",
            messages=[
 {"role": "user", "content": query}
 ]
 )
        
        self.tokens = self.response.usage
        self.model_used = self.response.model
        
        if verbose:
            print(f"Model used: {self.model_used}")
            print(f"Tokens: {self.tokens.total_tokens}")
            print("-" * len(self.model_used) + "\n")
    
        return self.response.choices[0].message.content
    else:
        print("Query Rejected")

Step 4: Final result

We can finally put all together and get our class:

class Route_layer:
    def __init__(self):
        self.routes = []
        self.routes_name = []
        self.route_layer: RouteLayer = None
        
        self.model: str = ""
        self.threshold: float = 0.0
        
        self.genai_strong: str = ""
        self.genai_weak: str = ""
        
        self.client: Controller = None
        
        self.tokens: Usage = None
        self.model_used: str = ""
        self.response: ModelResponse = None
        
    def define_layers_openai(self, layer_name, layers_text):
        for i, layer in enumerate(layer_name):
 route_content = Route(
                name=layer,
                utterances=layers_text[i],
 )
            self.routes.append(route_content)
            self.routes_name.append(route_content.name)
            
 encoder = OpenAIEncoder()
        self.route_layer = RouteLayer(encoder=encoder, routes=self.routes)
        
    def get_route(self, query):
        return self.route_layer(query)
    
    def execute_moderation(self, query):
        if self.route_layer(query).name in self.routes_name:
            print("Query Rejected")
            
    def define_parameters(self, threshold, model, genai_strong, genai_weak):
        self.threshold = threshold
        self.model = model
        self.genai_strong = genai_strong
        self.genai_weak = genai_weak
        
        self.client = Controller(
          routers=[model],
          strong_model=genai_strong,
          weak_model=genai_weak,
 )
        
            
    def get_result(self, query, verbose=False):
 route = self.route_layer(query)
        if route.name not in self.routes_name:
            self.response = self.client.chat.completions.create(
                model=f"router-{self.model}-{self.threshold}",
                messages=[
 {"role": "user", "content": query}
 ]
 )
            
            self.tokens = self.response.usage
            self.model_used = self.response.model
            
            if verbose:
                print(f"Model used: {self.model_used}")
                print(f"Tokens: {self.tokens.total_tokens}")
                print("-" * len(self.model_used) + "\n")
        
            return self.response.choices[0].message.content

Step 5: Testing

We can now check the class for different queries:

route_layer = Route_layer()

name = ["confidential"]
list_text = [[
    "My social security number is 123-45-6789.",
    "The password for my account is Password123.",
    "Her phone number is (555) 123-4567.",
    "His email address is john.doe@example.com.",
    "The bank account number is 9876543210.",
    "My credit card number is 4111 1111 1111 1111.",
    "The company's revenue last quarter was $1,000,000.",
    "Here are the login details: username 'admin', password 'secret'.",
    "The project code is classified as Top Secret.",
    "Her home address is 1234 Elm Street, Springfield."
]]

route_layer.define_layers_openai(name, list_text)

route_layer.execute_moderation("Give me the number of my coworker")

That rejects the query as it violates our policies.

Now we define the models that we will use as well as the thresholds:

route_layer.define_parameters(0.12, "mf", "nvidia_nim/meta/llama-3.1-405b-instruct", "nvidia_nim/meta/llama-3.1-8b-instruct")

We can now get information such as the strong model we are using:

route_layer.genai_strong

Now we can check different queries that can be passed to the model:

route_layer.get_result("What is the square root of 1,789,224", True)

Reply

Model used: nvidia_nim/meta/llama-3.1-405b-instruct
Tokens: 41
Response: “According to my calculations, the square root of 1,789,224 is:\n\n1338”

route_layer.get_result("Hello", True)

Reply

Model used: nvidia_nim/meta/llama-3.1-8b-instruct
Tokens: 16
Response: “How can I assist you?”

Another thing we could do is get the amount of tokens used, in case we want to monitor the cost of our LLM.

As we can see this works perfectly and can be useful when deploying our AI into the world, making it secure and cheaper. We can include semantic router for RAG application and more things to make it better, even going from python to Mojo to speed up.

Thanks for reading!

Citations

Router LLM

Blog Post: url Github Repo: url

@misc{ong2024routellmlearningroutellms, title={RouteLLM: Learning to Route LLMs with Preference Data}, author={Isaac Ong and Amjad Almahairi and Vincent Wu and Wei-Lin Chiang and Tianhao Wu and Joseph E. Gonzalez and M Waleed Kadous and Ion Stoica}, year={2024}, eprint={2406.18665}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2406.18665}, }

@misc{chiang2024chatbot, title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2403.04132}, archivePrefix={arXiv}, primaryClass={cs.AI} }

LiteLLM

Github Repo: url

Semantic Router

Github Repo: url

Description#

Installing and importing#

Step 0: Class design#

Step 1: Moderation Layer#

Step 2: API Call#

Step 3: Router Layer#

Step 4: Final result#

Step 5: Testing#

Citations#

Router LLM#

LiteLLM#

Semantic Router#