📎 Referral Code:
📊 Dashboard Sign In
Navigation
🗺️
Courses
🎬
Short Videos
💡
Pro Tip Videos
Job Support
🎯
Interview Board
👥
Chat Room
AI Tools
🌐
Project Explanation Agent
🛟
Support Works
Home
Generative AI Basics to Advanced level
Generative AI with RAG: Data Preparation from Scratch
Generative AI Basics to Advanced level Generative AI with RAG: Data Preparation from Scratch
Generative AI with RAG: Data Preparation from Scratch
Generative AI Basics to Advanced level
18:14
Now Watching
First Lesson
Lesson Progress
Next →
Embedding Concepts for RAG and Generative AI
Next
📄 View Reference Document & Notes

📋 Lesson Notes & Resources

Generative AI with RAG: Data Preparation from Scratch
What is Generative AI?
Generative AI refers to AI models that can generate new content such as text, images, or code. These models learn patterns from existing data and produce meaningful outputs.
What is RAG?
Retrieval-Augmented Generation (RAG) combines retrieval and generation. It first retrieves relevant data from a knowledge base and then uses a language model to generate accurate answers.
Why Data Preparation is Important?
Data preparation ensures that the input data is clean, structured, and meaningful for embedding and retrieval. Poor data leads to poor results.
Steps in Data Preparation
1. Collect raw data (CSV, PDFs, APIs)
2. Clean data (remove noise, duplicates)
3. Convert to structured format (JSON)
4. Chunk data into smaller pieces
5. Add metadata (source, category)
Example: CSV to JSON
{ "question": "What is Python?", "answer": "Python is a programming language" }
Final Output
Prepared data is now ready for embedding, vector storage, and search in a RAG system.

🚀 RAG Data Preparation Techniques

# Technique What it Means Why it is Important Example
1 Text Cleaning Remove HTML, symbols, noise Improves embedding quality <p>Hello</p> → Hello
2 Lowercasing Convert text to lowercase Avoid duplicate embeddings Python = python
3 Remove Stop Words (Optional) Remove common words Reduces noise is a language → language
4 Remove Special Characters Clean unwanted symbols Cleaner meaning @@Python!! → Python
5 Deduplication Remove duplicate content Avoid repeated results Same paragraph twice
6 Sentence Boundary Detection Split into sentences Keeps meaning intact NLP sentence splitting
7 Chunking (Fixed Size) Equal size chunks Core of RAG 300 words per chunk
8 Chunking (Semantic) 🔥 Split by meaning Better context Split by topic
9 Overlapping Chunks Add overlap Avoid context loss 1–100, 80–180
10 Context Enrichment Add missing context Self-contained chunks Python is…
11 Metadata Tagging Add extra info Helps filtering {"topic":"AI"}
12 Document Structuring Preserve sections Improves understanding Title + paragraph
13 Token Size Optimization Limit tokens Prevent errors 200–500 tokens
14 Noise Removal Remove junk Improves relevance Remove ads
15 Language Normalization Standardize text Better embeddings Fix grammar
16 Keyword Extraction Identify key terms Hybrid search Python, list
17 Named Entity Recognition Extract entities Better accuracy AWS, Lambda
18 Embedding Generation Text → vector Core of search OpenAI embeddings
19 Consistent Embedding Model Same model Prevent mismatch Same model everywhere
20 Data Filtering Remove irrelevant Improve precision Drop junk docs
21 Hierarchical Chunking Parent-child chunks Advanced retrieval Section → Subsection
22 Sliding Window Chunking Moving chunks Maintain continuity Shift by 50 words
23 Semantic Compression Reduce text Save cost Summarized chunk
24 Query-Aware Chunking Based on queries Better relevance FAQ chunks
25 Data Augmentation Add examples Better coverage Q&A pairs
26 Title Injection Add titles Better context Python Basics
27 Source Attribution Store source Trust & UI pdf1
28 Format Standardization Consistent schema Easier processing JSON format
29 Multilingual Handling Handle languages Better retrieval EN vs Telugu
30 Index Optimization Optimize index Faster search Mappings
Course Content
4 lessons