Building with ChatGPT - Part 3 - Using Langchain
Using Langchain and ChatGPT to analyse and query YouTube channels.
In part 3 of this series we are playing around with the Langchain library in order to figure out whether it might be useful for the startup advisor app we are trying to build. With the experiment we want to:
- Let ChatGPT ingest a YouTube channel
- Ask ChatGPT questions about the contents of the videos
As a specific example, we want to ask ChatGPT "what is the network state?" based on the videos uploaded to the crypto channel Bankless.
Getting the Data from YouTube
The first step to our app is getting the data from Youtube. Specifically, we want to get the subtitles of each video that has been uploaded to the channel. Since we are experimenting right now, we will save the output to a data file that we can use in the next steps (since getting all the data from the YouTube API may take quite a while, depending on channel size).
Langchain provides a module called "Document Loaders", which creates "documents" from various data sources like Notion, Google Drive or YouTube. Usage is pretty simple and looks like this:
from langchain.document_loaders import GoogleApiClient, GoogleApiYoutubeLoader
import pickle
def initialise_yt_data():
google_api_client = GoogleApiClient(token_path=Path("token.json"))
youtube_loader_channel = GoogleApiYoutubeLoader(google_api_client=google_api_client, channel_name="Bankless",captions_language="en", continue_on_failure=True)
documents = youtube_loader_channel.load()
with open('yt.data', 'wb') as filehandle:
pickle.dump(documents, filehandle)
return documents
def load_youtube_data():
with open('yt.data', 'rb') as filehandle:
documents = pickle.load(filehandle)
return documents
Before your first run of the script, you will need to create an app in the Google Cloud; this will allow you the script to access the YouTube API. On the first run of the code, a web browser will open in which you will have to confirm that the script is allowed to access the API (the access token then gets stored to the disk).
Splitting the Text & Adding the Text to Chroma
The next step is to split the text into smaller chunks before adding these chunks to a vectorstorage. Splitting into chunks can be complicated since we generally don't want to split randomly, but for a simple proof of concept we will use the RecursiveCharacterTextSplitter. When we pass these chunks along as context to ChatGPT, we need to make sure that the maximum token size will not be exceeded. This is why we are also passing along a function to count the tokens according to how ChatGPT counts them. Additionally, we are setting up a Chroma and embed our splitted documents in there:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
def initialise_db():
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=512, chunk_overlap=20)
texts = text_splitter.split_documents(documents)
db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory)
db.persist()
return db
def load_db():
db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
return db
Setting Up the Retrieval Chain & Querying
The final step before sending out our query is to setup the retrieval chain. We are using a RetrievalQA chain that allows us to do question answering over an index (make sure that the OpenAPI environment variable key is set) and send the query:
from langchain.chains import RetrievalQA
retriever = db.as_retriever()
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)#, reduce_k_below_max_tokens=True)
query = "What is the network state?"
result = qa({'query': query})
print(result)


So is this useful for the startup chat advisor app? Probably yes, because it allows us to add more context to the conversations we are having the app. If all the documentation for the startup is stored in Notion and we can enhance the context that ChatGPT has with this information then the resulting conversation with the chat bots will be better.