Chatting with Crypto: Governance Proposals and Documentation

Learn howto scrape proposal and documentation data and cache embeddings

In this blog post, we'll start working on the foundation for our DAO AI voting system that I've previously described on this blog. We will:

  • Ingest proposal data from snapshot
  • Ingest documentation from the projects Gitbooks
  • And then use a langchain RetrievalQA chain to chat with the available data

Much of the code is lifted from my previous AI chatbots projects, like the one that lets you chat with your Notion workspace. If you want to read more of a starters tutorial for chatting with your data, I recommend you check out these previous blog posts:

Here we will focus on showing how to create langchain documents manually, scraping from Gitbooks and how to use the embedding cache feature to save on processing time & API costs.

Creating Langchain Documents from Snapshot

Snapshot is probably the most popular platform in the blockchain space for managing DAOs and governance proposals. It provides a GraphQL API from which we can get all the proposals like so:

def proposals_from_graphql(spacename):
    url = "https://hub.snapshot.org/graphql"

    # GraphQL query
    query = """
    query Proposals($spaceName: String!) {
    proposals(
        first: 400,
        skip: 0,
        where: {
        space_in: [$spaceName],
        state: "closed"
        },
        orderBy: "created",
        orderDirection: desc
    ) {
        id
        title
        body
        choices
        scores
        start
        end
        snapshot
        state
        author
        space {
        id
        name
        }
    }
    }
    """

    # Define the variables
    variables = {
        "spaceName": spacename
    }

    # Send the GraphQL query with variables
    response = requests.post(url, json={"query": query, "variables": variables})

    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()
        proposals_data = data["data"]["proposals"]

        # TODO: Transform the list of proposal dictionaries into an array of internal objects

We send the GraphQL query with a given space name - this just identifies the DAO/organisation for which we want to get the proposals (one example is "aave.eth"). We then get a list of proposals back and then just have to store them in some way in our application. We recommend to also cache this somewhere so that you're not hitting the GraphQL API of snapshot all the time.

Now, let's assume that we've created proposal objects that we need to convert to Langchain documents in order to continue with our LLM processing:

from langchain.docstore.document import Document

# ... Class definition ...

    def to_documents(self):
        documents   = []

        metadata    = {}
        metadata["doctype"] = "proposal"
        metadata["title"]   = self.title
        metadata["author"]  = self.author
        metadata["date"]    = self.end

        formatted_choices = [str(item) for item in self.choices]
        formatted_scores = [str(item) for item in self.scores]

        metadata["choices"] = ", ".join(formatted_choices)
        metadata["scores"] = ", ".join(formatted_scores)
        doc = Document(page_content = self.body,
                       metadata = metadata)

        documents.append(doc)
        return documents

Here we are:

  • Creating the Document with the page_content mapping to the proposal body (this is the main text/proposal description).
  • Adding on some metadata field to the document, like the title, author and date.
  • Note that the Document metadata can't contain lists or maps. So for metadata that was saved as a list (in our case the voting choices and voting scores), we are simply joining the items together.

And that's it. We now have created Langchain documents from Snapshot data that we can use for creating text chunks and embeddings. Before we do that, let's quickly also get the documentation data.

Scraping and Parsing Gitbooks

This step is ridiculously easy: Langchain provides a document loader which we can simply give the documentation URL and it will scrape all the different pages in the Gitbook and give us the Langchain documents:

from langchain.document_loaders import GitbookLoader

def gitbook_to_documents(url):
    loader = GitbookLoader(url, load_all_paths=True)
    docs   = loader.load()
    for doc in docs:
        doc.metadata["doctype"] = "documentation"

    return docs

To differentiate the documentation data from the proposal data from the first step, we are adding a doctype metadata field.

Creating the Embeddings

Now that we have all the documents we can create the embeddings:

  1. Split the documents into text chunks.
  2. Initialise the cached embeddings object – this will save us a lot of processing time and API costs!
  3. Create the ChromaDB vectorstorage.
  4. Create the retriever.
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.vectorstores import Chroma
from langchain.storage import LocalFileStore
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings

def documents_to_retriever(documents):
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=globals.CHUNK_SIZE, chunk_overlap=globals.CHUNK_OVERLAP)
    texts         = text_splitter.split_documents(documents)

    embeddings      = OpenAIEmbeddings()
    fs              = LocalFileStore(globals.EMBEDDING_CACHE_PATH)
    cached_embedder = CacheBackedEmbeddings.from_bytes_store(embeddings, fs, namespace=embeddings.model)

    db = Chroma.from_documents(texts, cached_embedder)

    retriever = None
    if len(documents) >= 4:
        retriever = db.as_retriever(search_kwargs={"k": 4})
    else:
        retriever = db.as_retriever(search_kwargs={"k": 1})

    return retriever

And now that we have the retriever, we can - as usual - simply initialise our RetrievalQA chain and run the query with our templated prompt:

def query_retriever(retriever, query):
    prompt_text = """
    You are a blockchain and economics expert. Blockchain DAOs have governance proposals. You are being asked a question.
    First, here's the context:
    {context}

    Second, here are some more instructions: If in doubt, always use the most recent proposals by date. And keep the response below 15000 characters and end on a full sentence.

    And finally here is the question of the user: {question}
    """

    prompt = PromptTemplate(template=prompt_text, input_variables=["context", "question"])
    chain_type_kwargs = {"prompt": prompt}

    qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name="gpt-3.5-turbo", openai_api_key=globals.OPENAI_API_KEY_QUERY), chain_type="stuff", retriever=retriever,
                                     return_source_documents = True, chain_type_kwargs=chain_type_kwargs, verbose=True)
    result = qa({'query': query})
    return result

Next Steps: Testing AI on AAVE Documentation

Next up, we'll look into evaluating how well our chat AI performs by testing it on the documentation and proposal data of AAVE. Ensuring that our AI has a good understanding is the pre-requisite step to letting it evaluate and vote on governance proposals.

Subscribe to Endeavours Way

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe