Testing Our AI's AAVE Grasp: Exploring Evaluators and LLM Testing

We will create a QA and ranking evaluator to score our AI's performance.

Welcome back to our ongoing mission of crafting an AI voting assistant for blockchain DAOs. Last week, we loaded up AAVE's documentation and governance proposals, created embeddings and ran our first queries. But before we plunge headfirst into AI-driven voting, there's an essential checkpoint: ensuring our AI actually understands AAVE.

Digging into AI Testing with Custom Evaluators

While it might seem tempting to measure AI's understanding by calculating word distances, we're opting for a more hands-on approach. This is because word (or embedding distance) can be misleading - for example: The distance between "CVE for Windows 10" and "CVE for Windows 11" is quite low, but there is a big gap between the two. We will use two custom end-to-end evaluators for our test: These evaluators are like our AI's coaches, giving it quizzes to see if it's genuinely getting the hang of AAVE.

Taking a Look at Evaluator Basics

Let's start off with how our Evaluator base class works: Think of it as giving the AI multiple tries at a test, and then calculating the average score—giving us a real sense of how well it's doing.

class Evaluator():
    def __init__(self, evaluation_type):
        self.evaluation_type    = evaluation_type
        self.evaluation_scores  = []

    def get_average_evaluation_score(self):
        return sum(self.evaluation_scores) / len(self.evaluation_scores)

    def evaluate(self, num_runs, **kwargs):
        for i in range(0, num_runs):
            score = self.evaluate_single_run(**kwargs)
            if score == None:
                print("Error while handling query!")
                score = 0.0

            self.evaluation_scores.append(score)

    def evaluate_single_run(self, **kwargs):
        pass

Further down in the article, we'll derive two evaluators from this base class. We'll add them to a simple EvaluatorCollection that gathers up all the results and calculates an average score:

class EvaluatorCollection():
    def __init__(self):
        self.evaluators = []

    def add_evaluator(self, evaluator):
        self.evaluators.append(evaluator)

    def run_evaluators(self, **kwargs):
        for evaluator in self.evaluators:
            evaluator.evaluate(5, **kwargs)

    def __str__(self):
        ret = "Evaluation:"
        for evaluator in self.evaluators:
            ret += ("\n" + evaluator.evaluation_type + ": " + str(evaluator.get_average_evaluation_score()))

        ret += ("\noverall: " + str(self.get_average_evaluation_scores()))
        return ret

    def get_average_evaluation_scores(self):
        overall_score = 0.0
        for evaluator in self.evaluators:
            overall_score += evaluator.get_average_evaluation_score()

        return overall_score/len(self.evaluators)

Evaluator Base Code

Ranking Evaluator: Sorting Things Out

Picture the ranking evaluator as a sorter—arranging answers in the right order. We throw it a question, a model's answer, the right answer, a wrong answer, and a completely off-topic answer. The evaluator's task? To arrange these answers so the model's answer ends up on top and the random answer at the bottom.

import globals
from  evaluators.evaluator_base import Evaluator
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

class EvaluatorRanking(Evaluator):
    def __init__(self, query, 
                 reference_answer,
                 reference_false_answer,
                 reference_irrelevant_answer,
                 model_response):
        super().__init__("ranking")
        self.query                       = query
        self.reference_answer            = reference_answer
        self.reference_false_answer      = reference_false_answer
        self.reference_irrelevant_answer = reference_irrelevant_answer
        self.model_response              = model_response

    def evaluate_single_run(self, **kwargs):
        retriever = kwargs["retriever"]

        prompt_part2 = """
        Here is the reference answer that has been verified as correct by a human:
        {reference_answer}

        Here are 3 alternative answers provided by LLMs:
        A. {reference_false_answer}
        B. {model_response}
        C. {reference_irrelevant_answer}

        Rank these three alternative answers from best to worse by comparing it to the reference answer.
        Only answer with the letters in order from best to worst. An example for the output format I want would be 'A, B, C'.
        """

        response_scores = {
            "B, A, C": 1.00,
            "B, C, A": 0.75,
            "A, B, C": 0.50,
            "C, B, A": 0.10,
            "A, C, B": 0.00,
            "C, A, B": 0.00
        }

        prompt_part2 = prompt_part2.format(reference_answer       = self.reference_answer,
                                           reference_false_answer = self.reference_false_answer,
                                           model_response         = self.model_response,
                                           reference_irrelevant_answer = self.reference_irrelevant_answer)
        # Note: This will correctly use the query only, not the whole template for document retrieval
        prompt_text = """
        First, here is some relevant context from our knowledge base:
        {context}
        I provided a LLM with a knowledge base to answer the following question:
        {question}
        """ + prompt_part2

        prompt = PromptTemplate(template=prompt_text, input_variables=["context", "question"])
        chain_type_kwargs = {"prompt": prompt}

        qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name="gpt-3.5-turbo", openai_api_key=globals.OPENAI_API_KEY_QUERY), chain_type="stuff", retriever=retriever,
                                        return_source_documents = False, chain_type_kwargs=chain_type_kwargs, verbose=True)
        result = qa({'query': self.query})["result"].strip()

        if result in response_scores.keys():
            return (response_scores[result])

        return None

Ranking Evaluator Code

QA Evaluator: Picking the Correct Path

In the realm of AI, picking the right answer is the name of the game. Our QA (Question-Answer) evaluator embodies this skill. Armed with a question, a right answer, and a bunch of wrong answers, the AI's challenge is to pinpoint the right response. The outcome? A simple score: 1 for a win, 0 for a miss.

import globals
from  evaluators.evaluator_base import Evaluator
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

class EvaluatorQA(Evaluator):
    def __init__(self, query, correct_answer, incorrect_answers):
        super().__init__("qa")
        self.query              = query
        self.correct_answer     = correct_answer
        self.incorrect_answers  = incorrect_answers

    def evaluate_single_run(self, **kwargs):
        retriever = kwargs["retriever"]

        prompt_part2 = """
        Answers:
        {answers}

        Please choose the correct answer by providing the answer letter only.
        As an example for the output format, the answer you should provide should follow this format: 'A'.
        """

        answers = self.incorrect_answers + [self.correct_answer]
        answers = [chr(ord('A') + i) + ". " + answer for i, answer in enumerate(answers)]
        prompt_part2 = prompt_part2.format(answers = "\n".join(answers))

        # Note: This will correctly use the query only, not the whole template for document retrieval
        prompt_text = """
        First, here is some relevant context from our knowledge base:
        {context}
        Here is a question and a few possible answers:
        Question:
        {question}
        """ + prompt_part2

        prompt = PromptTemplate(template=prompt_text, input_variables=["context", "question"])
        chain_type_kwargs = {"prompt": prompt}

        qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name="gpt-3.5-turbo", 
                                                    openai_api_key=globals.OPENAI_API_KEY_QUERY), 
                                        verbose=True,
                                        chain_type="stuff", retriever=retriever,
                                        return_source_documents = False, 
                                        chain_type_kwargs=chain_type_kwargs)
        result = qa({'query': self.query})
        result = result["result"]

        if result == 'D':
            return 1
        else:
            return 0

QA Evaluator Code

Putting AI to the Test

Enough theory—let's roll up our sleeves and see how our AI tackles these tests. Our trial includes:

  • 5 ranking questions and 5 QA questions about AAVE's documentation (eg "How does GHO contribute to the DAO treasury?")
  • 3 ranking questions and 3 QA questions taken from real governance proposals (eg "Why was a flash minter approved at GHO mainnet launch?")

At the moment, our AI boasts an average score of 0.81—while that is decent, it is not good enough as a base for an AI voting tool.

    add_ranking_evaluator(ec,
                          "How does GHO contribute to the DAO treasury?",
                          "GHO can be borrowed after supplying collateral. Any interest payments go to the DAO treasury.",
                          "There is a tax on all GHO mint transactions. This tax goes to the DAO treasury.",
                          "By flying tourists to the moon")

    add_ranking_evaluator(ec,
                          "Why was a flash minter approved at GHO mainnet launch?",
                          "Because it would be a beneficial influence on GHO's ability to maintain it's peg", 
                          "To allow for easier market manipulation through flash loans",
                          "For the lolz")

    add_qa_evaluator(ec,
                     "What was the reason for offboarding BUSD?",
                     "Paxos stopped minting BUSD due to recent developments with the SEC",
                     ["BUSD depegged from the USD",
                     "It was unclear whether there were enough BUSD reserves",
                     "Because AAVE was planning to release their own stablecoin"])

Some examples questions added to our evaluator

What's Next

Before we continue with building our AI voting tool, we need to improve the AI's understanding of the AAVE DAO. Our aim is to increase the evaluation score to at least 0.9 for now (in a later iteration, we will aim to get it even higher and test it on a larger question set): We will do that by doing some prompt engineering and fine-tuning the document retrieval process.

Subscribe to Endeavours Way

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe