YouTube Transcripts for RAG Pipelines — Chunked, Metadata-Rich, Ready to Embed
Raw YouTube transcripts are not RAG-ready. YouTube returns transcripts as 2–5 second segments — fragments of roughly 8–20 tokens each. Embedding models work best with 200–400 tokens of coherent text (Vectara NAACL 2025, NVIDIA benchmark, Chroma Research, Microsoft Azure AI Search). Feed them 15-token fragments and your retrieval quality degrades immediately: queries can't match context that's been cut into arbitrary pieces, and there's no metadata to filter by video, channel, or timestamp.
Every developer building a YouTube-based RAG pipeline hits this problem and solves it manually: merge segments, pick a chunk size, handle overlap, attach metadata, format for the vector database. INDXR.AI's RAG JSON export does that in one click.
What the Output Actually Looks Like
Here's a real chunk from a 3Blue1Brown neural networks video (19 min, AssemblyAI transcription, 60s preset):
{
"metadata": {
"video_id": "aircAruvnKk",
"title": "But what is a neural network? | Deep learning chapter 1",
"duration_seconds": 1119,
"extraction_method": "assemblyai",
"extracted_at": "2026-04-23T18:55:35.850Z",
"chunking_config": {
"chunk_size_seconds": 60,
"overlap_seconds": 9,
"overlap_strategy": "sentence_boundary",
"total_chunks": 18
}
},
"chunks": [
{
"chunk_index": 0,
"chunk_id": "aircAruvnKk_chunk_000",
"text": "This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3. And I want you to take a moment to appreciate how crazy it is that brains can do this so effortlessly...",
"start_time": 4.434,
"end_time": 67.98,
"deep_link": "https://youtu.be/aircAruvnKk?t=4",
"token_count_estimate": 251,
"metadata": {
"video_id": "aircAruvnKk",
"title": "But what is a neural network? | Deep learning chapter 1",
"chunk_index": 0,
"total_chunks": 18,
"start_time": 4.434,
"end_time": 67.98,
"language": null
}
}
]
}A few things worth noting directly.
deep_link is pre-constructed per chunk. Click it and you land on the exact second the chunk starts in the video. When your LLM cites a source, it can link to the moment, not just the video page.
metadata is flat. Vector databases require scalar key-value pairs — no nested objects. The structure here loads directly into Pinecone, ChromaDB, Weaviate, and Qdrant without transformation.
token_count_estimate uses the cl100k_base approximation (~1.33 tokens per word). It lets you verify chunks fit your embedding model's context window without running a tokenizer yourself.
overlap_strategy tells you how the overlap was computed. For AssemblyAI transcripts with punctuation, we use sentence-boundary detection — the overlap ends on a complete sentence. For auto-caption transcripts without punctuation, we use segment-boundary overlap instead.
Chunk Size Options
Four presets, configurable in Settings → Developer Exports:
| Preset | Duration | ~Tokens | Best for |
|---|---|---|---|
| Quote | 30s | ~100 | Short-form content, granular retrieval |
| Balanced | 60s | ~200 | Default — works across most use cases |
| Precise | 90s | ~300 | Inside the research-backed sweet spot |
| Context | 120s | ~400 | Lectures, long-form analysis |
The 60s default balances retrieval granularity with semantic completeness. For lecture content like the Karpathy GPT video (1h56m), 90s produced 89 chunks with ~400 tokens each — the range that performs best for analytical queries according to NVIDIA's 2024 benchmark.
Loading into LangChain
Each chunk maps directly to LangChain's Document schema:
import json
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
with open("transcript_rag.json") as f:
data = json.load(f)
documents = [
Document(
page_content=chunk["text"],
metadata=chunk["metadata"]
)
for chunk in data["chunks"]
]
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents, embeddings)
results = vectorstore.similarity_search(
"What is the core challenge with raw transcripts?",
k=3
)
for doc in results:
print(f"[{doc.metadata['start_time']}s] {doc.page_content[:200]}")Loading into Pinecone
import json
from openai import OpenAI
from pinecone import Pinecone
with open("transcript_rag.json") as f:
data = json.load(f)
client = OpenAI()
pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("youtube-transcripts")
vectors = []
for chunk in data["chunks"]:
embedding = client.embeddings.create(
input=chunk["text"],
model="text-embedding-3-small"
).data[0].embedding
vectors.append({
"id": chunk["chunk_id"],
"values": embedding,
"metadata": chunk["metadata"]
})
for i in range(0, len(vectors), 100):
index.upsert(vectors=vectors[i:i+100])Auto-Captions vs. AI Transcription for RAG
The difference matters more for RAG than for any other use case.
Auto-captions lack punctuation. Text arrives as lowercase words without sentence boundaries. When the chunker tries to detect where sentences end for overlap computation, it can't — so it falls back to segment-boundary overlap instead. The chunks still work, but the overlap is less semantically clean.
Auto-captions are also less accurate than AssemblyAI, particularly for accents, domain vocabulary, and fast speech. Errors propagate into your embeddings.
For RAG pipelines where retrieval quality matters, use AI Transcription. The resulting chunks have proper sentence boundaries, accurate text, and sentence-level overlap. For a 19-minute video, AI Transcription costs 19 credits — roughly €0.23 at Basic pricing.
One specific case where auto-captions are fine: if your downstream pipeline does its own text cleaning and doesn't rely on sentence boundaries for chunking decisions.
Pricing
RAG JSON export: 1 credit per 15 minutes of video, minimum 1.
| Video length | Credits |
|---|---|
| 0–15 min | 1 credit |
| 16–30 min | 2 credits |
| 31–60 min | 4 credits |
| 1h56min (Karpathy GPT) | 8 credits |
| 2h49min (Joe Rogan Snowden) | 12 credits |
First 3 exports free. Credits never expire.
For the standard (non-chunked) JSON format, see YouTube Transcript JSON Export. For a deep dive into chunk size research and overlap strategy, see How to Chunk YouTube Transcripts for RAG. For credit packages, see the pricing page.
Frequently Asked Questions
- Does this work for playlists?
- Yes. Extract a playlist and every video gets its own RAG JSON file.
- Does this work for audio I upload myself?
- Yes. Upload any audio file via the Audio tab. The output is identical — channel and language will be null since there's no YouTube metadata.
- Can I change the chunk size after export?
- Yes. Set your preferred default in Settings → Developer Exports. You can re-export any saved transcript with a different preset — no re-transcription needed.
- What embedding model should I use?
- OpenAI text-embedding-3-small is a practical default for the 200–400 token range our chunks produce. Cohere embed-english-v3.0 and Voyage AI voyage-3 are strong alternatives.
Sources
- Vectara NAACL 2025 — Chunking strategy benchmark (25 configs × 48 embedding models)
- NVIDIA Technical Blog — Finding the Best Chunking Strategy for Accurate AI Responses
- Chroma Research — Evaluating Chunking Strategies for Retrieval
- Microsoft Azure AI Search — How to chunk documents for vector search
- LangChain — Document schema concepts
- Pinecone — Upsert data
- Weaviate — documentation
- Qdrant — documentation



