Generative AI with RAG: Data Preparation from Scratch

Generative AI Basics to Advanced level

18:14

Now Watching

—

First Lesson

Lesson Progress

Embedding Concepts for RAG and Generative AI

📄 View Reference Document & Notes

📋 Lesson Notes & Resources

Generative AI with RAG: Data Preparation from Scratch

What is Generative AI?

Generative AI refers to AI models that can generate new content such as text, images, or code. These models learn patterns from existing data and produce meaningful outputs.

What is RAG?

Retrieval-Augmented Generation (RAG) combines retrieval and generation. It first retrieves relevant data from a knowledge base and then uses a language model to generate accurate answers.

Why Data Preparation is Important?

Data preparation ensures that the input data is clean, structured, and meaningful for embedding and retrieval. Poor data leads to poor results.

Steps in Data Preparation

1. Collect raw data (CSV, PDFs, APIs)
2. Clean data (remove noise, duplicates)
3. Convert to structured format (JSON)
4. Chunk data into smaller pieces
5. Add metadata (source, category)

Example: CSV to JSON

{ "question": "What is Python?", "answer": "Python is a programming language" }

Final Output

Prepared data is now ready for embedding, vector storage, and search in a RAG system.

🚀 RAG Data Preparation Techniques

#	Technique	What it Means	Why it is Important	Example
1	Text Cleaning	Remove HTML, symbols, noise	Improves embedding quality	<p>Hello</p> → Hello
2	Lowercasing	Convert text to lowercase	Avoid duplicate embeddings	Python = python
3	Remove Stop Words (Optional)	Remove common words	Reduces noise	is a language → language
4	Remove Special Characters	Clean unwanted symbols	Cleaner meaning	@@Python!! → Python
5	Deduplication	Remove duplicate content	Avoid repeated results	Same paragraph twice
6	Sentence Boundary Detection	Split into sentences	Keeps meaning intact	NLP sentence splitting
7	Chunking (Fixed Size)	Equal size chunks	Core of RAG	300 words per chunk
8	Chunking (Semantic) 🔥	Split by meaning	Better context	Split by topic
9	Overlapping Chunks	Add overlap	Avoid context loss	1–100, 80–180
10	Context Enrichment	Add missing context	Self-contained chunks	Python is…
11	Metadata Tagging	Add extra info	Helps filtering	{"topic":"AI"}
12	Document Structuring	Preserve sections	Improves understanding	Title + paragraph
13	Token Size Optimization	Limit tokens	Prevent errors	200–500 tokens
14	Noise Removal	Remove junk	Improves relevance	Remove ads
15	Language Normalization	Standardize text	Better embeddings	Fix grammar
16	Keyword Extraction	Identify key terms	Hybrid search	Python, list
17	Named Entity Recognition	Extract entities	Better accuracy	AWS, Lambda
18	Embedding Generation	Text → vector	Core of search	OpenAI embeddings
19	Consistent Embedding Model	Same model	Prevent mismatch	Same model everywhere
20	Data Filtering	Remove irrelevant	Improve precision	Drop junk docs
21	Hierarchical Chunking	Parent-child chunks	Advanced retrieval	Section → Subsection
22	Sliding Window Chunking	Moving chunks	Maintain continuity	Shift by 50 words
23	Semantic Compression	Reduce text	Save cost	Summarized chunk
24	Query-Aware Chunking	Based on queries	Better relevance	FAQ chunks
25	Data Augmentation	Add examples	Better coverage	Q&A pairs
26	Title Injection	Add titles	Better context	Python Basics
27	Source Attribution	Store source	Trust & UI	pdf1
28	Format Standardization	Consistent schema	Easier processing	JSON format
29	Multilingual Handling	Handle languages	Better retrieval	EN vs Telugu
30	Index Optimization	Optimize index	Faster search	Mappings