Chatting with Crypto: Governance Proposals and Documentation
Learn howto scrape proposal and documentation data and cache embeddings
In this blog post, we'll start working on the foundation for our DAO AI voting system that I've previously described on this blog. We will:
- Ingest proposal data from snapshot
- Ingest documentation from the projects Gitbooks
- And then use a langchain RetrievalQA chain to chat with the available data
Much of the code is lifted from my previous AI chatbots projects, like the one that lets you chat with your Notion workspace. If you want to read more of a starters tutorial for chatting with your data, I recommend you check out these previous blog posts:
Here we will focus on showing how to create langchain documents manually, scraping from Gitbooks and how to use the embedding cache feature to save on processing time & API costs.
Creating Langchain Documents from Snapshot
Snapshot is probably the most popular platform in the blockchain space for managing DAOs and governance proposals. It provides a GraphQL API from which we can get all the proposals like so:
def proposals_from_graphql(spacename):
url = "https://hub.snapshot.org/graphql"
# GraphQL query
query = """
query Proposals($spaceName: String!) {
proposals(
first: 400,
skip: 0,
where: {
space_in: [$spaceName],
state: "closed"
},
orderBy: "created",
orderDirection: desc
) {
id
title
body
choices
scores
start
end
snapshot
state
author
space {
id
name
}
}
}
"""
# Define the variables
variables = {
"spaceName": spacename
}
# Send the GraphQL query with variables
response = requests.post(url, json={"query": query, "variables": variables})
# Check if the request was successful
if response.status_code == 200:
data = response.json()
proposals_data = data["data"]["proposals"]
# TODO: Transform the list of proposal dictionaries into an array of internal objects
We send the GraphQL query with a given space name - this just identifies the DAO/organisation for which we want to get the proposals (one example is "aave.eth"). We then get a list of proposals back and then just have to store them in some way in our application. We recommend to also cache this somewhere so that you're not hitting the GraphQL API of snapshot all the time.
Now, let's assume that we've created proposal objects that we need to convert to Langchain documents in order to continue with our LLM processing:
from langchain.docstore.document import Document
# ... Class definition ...
def to_documents(self):
documents = []
metadata = {}
metadata["doctype"] = "proposal"
metadata["title"] = self.title
metadata["author"] = self.author
metadata["date"] = self.end
formatted_choices = [str(item) for item in self.choices]
formatted_scores = [str(item) for item in self.scores]
metadata["choices"] = ", ".join(formatted_choices)
metadata["scores"] = ", ".join(formatted_scores)
doc = Document(page_content = self.body,
metadata = metadata)
documents.append(doc)
return documents
Here we are:
- Creating the Document with the page_content mapping to the proposal body (this is the main text/proposal description).
- Adding on some metadata field to the document, like the title, author and date.
- Note that the Document metadata can't contain lists or maps. So for metadata that was saved as a list (in our case the voting choices and voting scores), we are simply joining the items together.
And that's it. We now have created Langchain documents from Snapshot data that we can use for creating text chunks and embeddings. Before we do that, let's quickly also get the documentation data.
Scraping and Parsing Gitbooks
This step is ridiculously easy: Langchain provides a document loader which we can simply give the documentation URL and it will scrape all the different pages in the Gitbook and give us the Langchain documents:
from langchain.document_loaders import GitbookLoader
def gitbook_to_documents(url):
loader = GitbookLoader(url, load_all_paths=True)
docs = loader.load()
for doc in docs:
doc.metadata["doctype"] = "documentation"
return docs
To differentiate the documentation data from the proposal data from the first step, we are adding a doctype metadata field.
Creating the Embeddings
Now that we have all the documents we can create the embeddings:
- Split the documents into text chunks.
- Initialise the cached embeddings object – this will save us a lot of processing time and API costs!
- Create the ChromaDB vectorstorage.
- Create the retriever.
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.storage import LocalFileStore
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings
def documents_to_retriever(documents):
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=globals.CHUNK_SIZE, chunk_overlap=globals.CHUNK_OVERLAP)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
fs = LocalFileStore(globals.EMBEDDING_CACHE_PATH)
cached_embedder = CacheBackedEmbeddings.from_bytes_store(embeddings, fs, namespace=embeddings.model)
db = Chroma.from_documents(texts, cached_embedder)
retriever = None
if len(documents) >= 4:
retriever = db.as_retriever(search_kwargs={"k": 4})
else:
retriever = db.as_retriever(search_kwargs={"k": 1})
return retriever
And now that we have the retriever, we can - as usual - simply initialise our RetrievalQA chain and run the query with our templated prompt:
def query_retriever(retriever, query):
prompt_text = """
You are a blockchain and economics expert. Blockchain DAOs have governance proposals. You are being asked a question.
First, here's the context:
{context}
Second, here are some more instructions: If in doubt, always use the most recent proposals by date. And keep the response below 15000 characters and end on a full sentence.
And finally here is the question of the user: {question}
"""
prompt = PromptTemplate(template=prompt_text, input_variables=["context", "question"])
chain_type_kwargs = {"prompt": prompt}
qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name="gpt-3.5-turbo", openai_api_key=globals.OPENAI_API_KEY_QUERY), chain_type="stuff", retriever=retriever,
return_source_documents = True, chain_type_kwargs=chain_type_kwargs, verbose=True)
result = qa({'query': query})
return result
Next Steps: Testing AI on AAVE Documentation
Next up, we'll look into evaluating how well our chat AI performs by testing it on the documentation and proposal data of AAVE. Ensuring that our AI has a good understanding is the pre-requisite step to letting it evaluate and vote on governance proposals.