YouTube Transcript JSON Export — What You Actually Get
If you've worked with YouTube transcript data programmatically, you know the frustration. The raw output from youtube-transcript-api — the most-used library for this — looks like this:
[
{"text": "everybody needs to learn to code", "start": 1.91, "duration": 2.1},
{"text": "coding is the new literacy", "start": 4.01, "duration": 1.8}
]No video title. No channel. No language. No end timestamp. Just fragments. You spend the next hour writing boilerplate to reconstruct what you actually need.
INDXR.AI exports transcripts as structured JSON with the metadata already in place. Here's exactly what you get and what it costs — no features described that aren't actually in the output.
Standard JSON — Free for Captioned Videos
For any YouTube video with auto-generated captions, the standard JSON export is free.
Here's the actual output, taken from a real export of Fireship's How to Learn to Code (6.75 min):
{
"metadata": {
"video_id": "NtfbWkxJTHw",
"title": "How to Learn to Code - 8 Hard Truths",
"channel": "Fireship",
"language": "en",
"published_at": "2022-02-09",
"duration_seconds": 405,
"extraction_method": "youtube_captions",
"extracted_at": "2026-04-23T18:38:07.820Z"
},
"segments": [
{
"text": "everybody needs to learn to code coding is the new literacy",
"start_time": 1.91,
"end_time": 4.01
},
{
"text": "if you can't code you'll soon become obsolete",
"start_time": 4.01,
"end_time": 6.32
}
]
}Every segment has start_time and end_time — calculated from the raw caption timing. The metadata wrapper includes the video title, channel, language, and publish date, extracted automatically from YouTube's data.
The honest limitation with auto-captions: The text arrives as a stream of lowercase words with no punctuation. Notice "everybody needs to learn to code coding is the new literacy" — no capitalization, no period. This is a YouTube limitation, not ours. For most data processing purposes it's workable. For anything that presents text to users or needs sentence boundaries for downstream NLP, it's a meaningful quality gap.
For non-English videos: YouTube's captioning system always returns the English translation via our extraction pipeline, regardless of the video's original language. If you need the original Arabic, Turkish, Spanish, or Portuguese text, use AI Transcription instead — it transcribes the actual audio in the original language.
Cost: Free. No credits, no account required for a single video.
AI Transcription + Standard JSON — 1 Credit Per Minute
When you enable AI Transcription, INDXR.AI downloads the video audio and runs it through AssemblyAI Universal-3 Pro. The output format is identical — same metadata wrapper, same segments array — but the text quality changes substantially.
Here's what changes in the segments:
{
"segments": [
{
"text": "This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3.",
"start_time": 4.434,
"end_time": 10.315
}
]
}Proper capitalization. Proper punctuation. Sentence boundaries. This is from 3Blue1Brown's neural networks video — the same content that auto-captions would give you as an unpunctuated lowercase stream.
The difference matters for three specific situations:
First, AI Transcription works for videos without captions at all. Roughly 20% of YouTube videos have no auto-generated captions. For these, it's the only option.
Second, AssemblyAI is more accurate than YouTube auto-captions for English and other supported languages — particularly with accents, fast speech, and technical vocabulary.
Third, if you're building a RAG pipeline, punctuated text with sentence boundaries enables sentence-level chunking. Without punctuation, chunkers cut through sentences arbitrarily.
Cost: 1 credit per minute, minimum 1 credit.
| Video length | Credits | Cost at Basic (€6.99/500cr) | Cost at Plus (€13.99/1,200cr) |
|---|---|---|---|
| 10 min | 10 | €0.14 | €0.12 |
| 30 min | 30 | €0.42 | €0.35 |
| 1 hour | 60 | €0.84 | €0.70 |
| 2 hours | 120 | €1.68 | €1.40 |
RAG JSON — For Vector Databases and AI Pipelines
If you're loading transcripts into a vector database or building a retrieval-augmented pipeline, the standard JSON format isn't what you want. The 2–5 second segments are too small for embedding — each fragment contains roughly 8–20 tokens, far below the 200–400 token range where embedding models perform best (Vectara NAACL 2025, NVIDIA benchmark).
RAG JSON handles that transformation. It merges segments into configurable chunks (30s / 60s / 90s / 120s), adds sentence-boundary overlap, and attaches per-chunk metadata ready for direct vector database upsert.
Here's a real chunk from the Andrej Karpathy Let's build GPT video (1h56m, 90s preset):
{
"chunk_index": 0,
"chunk_id": "kCc8FmEb1nY_chunk_000",
"text": "hi everyone so by now you have probably heard of chat GPT it has taken the world and AI Community by storm...",
"start_time": 2.51,
"end_time": 93.439,
"deep_link": "https://youtu.be/kCc8FmEb1nY?t=2",
"token_count_estimate": 404,
"metadata": {
"video_id": "kCc8FmEb1nY",
"title": "Let's build GPT: from scratch, in code, spelled out.",
"channel": "Andrej Karpathy",
"chunk_index": 0,
"total_chunks": 89,
"start_time": 2.51,
"end_time": 93.439,
"language": "en"
}
}The deep_link field is pre-constructed and points to the exact second the chunk starts. The metadata object is flat — the structure Pinecone and ChromaDB require for filtering.
What you should know: RAG JSON on auto-captions works, but the overlap strategy differs from AssemblyAI. Without punctuation, we can't detect sentence boundaries, so overlap uses segment-level alignment instead. The result is still useful, but AssemblyAI-sourced transcripts produce cleaner chunks with true sentence boundaries. This is reflected in the overlap_strategy field in the output: "sentence_boundary" for AssemblyAI, "segment_boundary" for auto-captions.
Cost: 1 credit per 15 minutes of video, minimum 1 credit.
| Video length | Credits |
|---|---|
| 0–15 min | 1 credit |
| 16–30 min | 2 credits |
| 31–60 min | 4 credits |
| 61–120 min | 8 credits |
| 2+ hours | 1 credit per 15 min |
The first 3 RAG exports are free. Credits never expire.
What You'd Add Yourself
The output doesn't include everything some pipelines want. Specifically: channel and language are not available for audio uploads (only YouTube video extraction), since those fields come from YouTube's metadata. If you need formatted timestamps ("00:01:32") rather than float seconds, construct them from start_time. If you need a YouTube deep link and you already have the video ID, it's https://youtu.be/{video_id}?t={Math.floor(start_time)} — the same formula we use.
For the full RAG-optimized export with chunking, overlap configuration, and LangChain / Pinecone integration examples, see YouTube Transcripts for RAG Pipelines. For audio file uploads, see Audio Upload. For credit packages, see the pricing page.
Frequently Asked Questions
- Is standard JSON always free?
- Yes. Caption-based extraction is free, and the standard JSON export adds no credit cost on top. You pay only for AI Transcription (1 credit/min) and RAG JSON (1 credit/15 min).
- Does this work for audio files I upload myself?
- Yes. Upload MP3, MP4, WAV, M4A, OGG, FLAC, WEBM up to 500MB. Standard JSON and RAG JSON are both available after transcription. channel and language will be null since there's no YouTube metadata to fetch.
- What's the difference between standard JSON and RAG JSON?
- Standard JSON gives you 2–5 second segments — the raw caption timing. RAG JSON merges those into configurable chunks (30s–120s) with overlap, per-chunk deep links, token count estimates, and flat metadata. Standard JSON is a data format. RAG JSON is a pipeline-ready input.
- Does AI Transcription improve accuracy for English?
- Yes. AssemblyAI Universal-3 Pro outperforms YouTube auto-captions for accuracy, particularly with accents, fast speech, and domain-specific vocabulary. The bigger difference is punctuation — AssemblyAI adds it, auto-captions don't.
Sources
- Vectara NAACL 2025 — Chunking strategy benchmark (25 configs × 48 embedding models)
- NVIDIA Technical Blog — Finding the Best Chunking Strategy for Accurate AI Responses
- AssemblyAI Universal-3 Pro — speech-to-text model
- LangChain YoutubeLoader — document loader docs
- Pinecone — Filter with metadata
- ChromaDB — documentation



