Testing Our AI's AAVE Grasp: Exploring Evaluators and LLM Testing
We will create a QA and ranking evaluator to score our AI's performance.
Welcome back to our ongoing mission of crafting an AI voting assistant for blockchain DAOs. Last week, we loaded up AAVE's documentation and governance proposals, created embeddings and ran our first queries. But before we plunge headfirst into AI-driven voting, there's an essential checkpoint: ensuring our AI actually understands AAVE.
Digging into AI Testing with Custom Evaluators
While it might seem tempting to measure AI's understanding by calculating word distances, we're opting for a more hands-on approach. This is because word (or embedding distance) can be misleading - for example: The distance between "CVE for Windows 10" and "CVE for Windows 11" is quite low, but there is a big gap between the two. We will use two custom end-to-end evaluators for our test: These evaluators are like our AI's coaches, giving it quizzes to see if it's genuinely getting the hang of AAVE.
Taking a Look at Evaluator Basics
Let's start off with how our Evaluator base class works: Think of it as giving the AI multiple tries at a test, and then calculating the average score—giving us a real sense of how well it's doing.
def __init__(self, evaluation_type):
self.evaluation_type = evaluation_type
self.evaluation_scores = 
return sum(self.evaluation_scores) / len(self.evaluation_scores)
def evaluate(self, num_runs, **kwargs):
for i in range(0, num_runs):
score = self.evaluate_single_run(**kwargs)
if score == None:
print("Error while handling query!")
score = 0.0
def evaluate_single_run(self, **kwargs):
Further down in the article, we'll derive two evaluators from this base class. We'll add them to a simple EvaluatorCollection that gathers up all the results and calculates an average score:
Like what you're reading?
Sign up now to Endeavours Edge to read more like this.
No spam. Unsubscribe anytime.
Ranking Evaluator: Sorting Things Out
Picture the ranking evaluator as a sorter—arranging answers in the right order. We throw it a question, a model's answer, the right answer, a wrong answer, and a completely off-topic answer. The evaluator's task? To arrange these answers so the model's answer ends up on top and the random answer at the bottom.
QA Evaluator: Picking the Correct Path
In the realm of AI, picking the right answer is the name of the game. Our QA (Question-Answer) evaluator embodies this skill. Armed with a question, a right answer, and a bunch of wrong answers, the AI's challenge is to pinpoint the right response. The outcome? A simple score: 1 for a win, 0 for a miss.
Putting AI to the Test
Enough theory—let's roll up our sleeves and see how our AI tackles these tests. Our trial includes:
5 ranking questions and 5 QA questions about AAVE's documentation (eg "How does GHO contribute to the DAO treasury?")
3 ranking questions and 3 QA questions taken from real governance proposals (eg "Why was a flash minter approved at GHO mainnet launch?")
At the moment, our AI boasts an average score of 0.81—while that is decent, it is not good enough as a base for an AI voting tool.
Before we continue with building our AI voting tool, we need to improve the AI's understanding of the AAVE DAO. Our aim is to increase the evaluation score to at least 0.9 for now (in a later iteration, we will aim to get it even higher and test it on a larger question set): We will do that by doing some prompt engineering and fine-tuning the document retrieval process.
Subscribe to the Endeavours Edge
We are an applied AI x Blockchain research lab - we explore these technologies, build with them and provide education and advisory services.
No spam. Unsubscribe anytime.
Subscribe to Endeavours Way
Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.