Building with ChatGPT - Part 4 - Pinecone & Streamlit

Integrating Pinecone and Streamlit to create a web AI app.

Building with ChatGPT - Part 4 - Pinecone & Streamlit
Photo by Nate Grant / Unsplash

In this part, we'll pick up where we left off last time by extending the small script (that let us ask Youtube channels anything) with a quick UI and moving the database to another provider. If you haven't caught up, check out part 3 first, since we will re-use many of the parts that we built before with the Langchain library ⬇️

Building with ChatGPT - Part 3 - Using Langchain
Using Langchain and ChatGPT to analyse and query YouTube channels.

The AI Stack

The script we built in the previous blog post was a simple Python app, that just stored the data locally in a Chroma vectorstorage (that we accessed via Langchain).

With the changes we are making now, we want to make the app available to multiple users. To do this, we are making the following big changes:

  1. Moving from local Chroma to Cloud Pinecone for data storage
  2. Adding a web interface via Streamlit

Moving from Chroma to Pinecone

We used a local Chroma instance to store the transcripts from the Youtube channels. While it is possible to host Chroma on the web and connect to it, Pinecone provides an easier "cloud-by-default" option.

Because we're using Langchain, most of the dirty work has been abstracted for us. We still start off by using the RecursiveCharacterTextSplitter to create the chunks for us and then:

  1. Initialise Pinecone with the API key from their homepage
  2. Create a new index in the Pinecone environment
  3. Load the chunks into the index (this may take some time)
def initialise_index(channel_name):
    print("Initialising index")
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=globals.CHUNK_SIZE, chunk_overlap=globals.CHUNK_OVERLAP)
    texts = text_splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings(openai_api_key=globals.OPENAI_API_KEY_EMBED)

    pinecone.init(api_key=globals.PINECONE_API_KEY, environment=globals.PINECONE_ENV)
    pinecone.create_index(name=channel_name.lower(), dimension=1536)
    index = Pinecone.from_documents(texts, embeddings, index_name=channel_name.lower())

    return index

The above code snippet is for creating the vectorstorage database and filling it with data. When we want to query it we have to connect to Pinecone again with the API key and then just get the index we created before:

def load_index(channel_name):
    channel_index_name = channel_name.lower()
    pinecone.init(api_key=globals.PINECONE_API_KEY, environment=globals.PINECONE_ENV)
    index = Pinecone.from_existing_index(channel_index_name, embeddings)
    return index

def query_index(index, query):
    retriever = index.as_retriever()
    qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=globals.OPENAI_API_KEY_QUERY), chain_type="stuff", retriever=retriever)
    result = qa({'query': query})
    return result

Querying works exactly the same way as it did with the Chroma vectorstorage, thanks to the Langchain abstraction.

Creating the Streamlit UI

Now that we have stored the data coming from the Youtube channels in the cloud, we can build out a small UI with Streamlit (which is what many people in the AI field have been using to create a quick prototype). Here's the essential parts of the code to initialise Streamlit:

import streamlit as st
from streamlit_chat import message

st.set_page_config(page_title="Bankless Youtube AI Query Demo", page_icon=":robot:")
st.header("Bankless Youtube AI Query Demo")

def init_session_state(channel_name):
    if "history" not in st.session_state:
        st.session_state["history"] = []

    if "generated" not in st.session_state:
        st.session_state["generated"] = ["Hey There - Ask me anything about " + channel_name]

    if "past" not in st.session_state:
        st.session_state["past"] = ["Hey!"]
        
channel_name = "Bankless"
init_session_state(channel_name)

And here's how we handle the user input:

user_input   = None

with st.form(key='my_form', clear_on_submit=True):
    user_input    = st.text_input("Query:", placeholder="describe the concept of the network state", key='input')
    submit_button = st.form_submit_button(label='Send')

if submit_button and user_input:
    output = query_index(index, user_input)['result']
    st.session_state['history'].append(output)
    st.session_state['past'].append(user_input)
    st.session_state['generated'].append(output)

if st.session_state['generated']:
    for i in range(len(st.session_state['generated'])):
        message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
        message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")

... And here is the finished demo running on Streamlit 🎉

Youtube AI Demo

Built on the AI tech stack of Langchain, Pinecone and Streamlit, this demo allows you to query the Bankless crypto Youtube channel to get quick answers based on the video content.

Check out the demo now

The most interesting thing I learned during this is how accessible all of this AI power is. All of this is just roughly 100 lines of Python code! Everyone with just a little bit of Python knowledge & some perseverance can probably build this.

Subscribe to Endeavours Way

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe