- Home
- Course Detail
regularpython@gmail.com
You are now watching:
Generative AI Basics to Advanced level / of Generative AI with RAG: Data Preparation from Scratch
Generative AI with RAG: Data Preparation from Scratch
What is Generative AI?
Generative AI refers to AI models that can generate new content such as text, images, or code. These models learn patterns from existing data and produce meaningful outputs.
What is RAG?
Retrieval-Augmented Generation (RAG) combines retrieval and generation. It first retrieves relevant data from a knowledge base and then uses a language model to generate accurate answers.
Why Data Preparation is Important?
Data preparation ensures that the input data is clean, structured, and meaningful for embedding and retrieval. Poor data leads to poor results.
Steps in Data Preparation
1. Collect raw data (CSV, PDFs, APIs)
2. Clean data (remove noise, duplicates)
3. Convert to structured format (JSON)
4. Chunk data into smaller pieces
5. Add metadata (source, category)
2. Clean data (remove noise, duplicates)
3. Convert to structured format (JSON)
4. Chunk data into smaller pieces
5. Add metadata (source, category)
Example: CSV to JSON
{ "question": "What is Python?", "answer": "Python is a programming language" }
Final Output
Prepared data is now ready for embedding, vector storage, and search in a RAG system.
🚀 RAG Data Preparation Techniques
| # | Technique | What it Means | Why it is Important | Example |
|---|---|---|---|---|
| 1 | Text Cleaning | Remove HTML, symbols, noise | Improves embedding quality | <p>Hello</p> → Hello |
| 2 | Lowercasing | Convert text to lowercase | Avoid duplicate embeddings | Python = python |
| 3 | Remove Stop Words (Optional) | Remove common words | Reduces noise | is a language → language |
| 4 | Remove Special Characters | Clean unwanted symbols | Cleaner meaning | @@Python!! → Python |
| 5 | Deduplication | Remove duplicate content | Avoid repeated results | Same paragraph twice |
| 6 | Sentence Boundary Detection | Split into sentences | Keeps meaning intact | NLP sentence splitting |
| 7 | Chunking (Fixed Size) | Equal size chunks | Core of RAG | 300 words per chunk |
| 8 | Chunking (Semantic) 🔥 | Split by meaning | Better context | Split by topic |
| 9 | Overlapping Chunks | Add overlap | Avoid context loss | 1–100, 80–180 |
| 10 | Context Enrichment | Add missing context | Self-contained chunks | Python is… |
| 11 | Metadata Tagging | Add extra info | Helps filtering | {"topic":"AI"} |
| 12 | Document Structuring | Preserve sections | Improves understanding | Title + paragraph |
| 13 | Token Size Optimization | Limit tokens | Prevent errors | 200–500 tokens |
| 14 | Noise Removal | Remove junk | Improves relevance | Remove ads |
| 15 | Language Normalization | Standardize text | Better embeddings | Fix grammar |
| 16 | Keyword Extraction | Identify key terms | Hybrid search | Python, list |
| 17 | Named Entity Recognition | Extract entities | Better accuracy | AWS, Lambda |
| 18 | Embedding Generation | Text → vector | Core of search | OpenAI embeddings |
| 19 | Consistent Embedding Model | Same model | Prevent mismatch | Same model everywhere |
| 20 | Data Filtering | Remove irrelevant | Improve precision | Drop junk docs |
| 21 | Hierarchical Chunking | Parent-child chunks | Advanced retrieval | Section → Subsection |
| 22 | Sliding Window Chunking | Moving chunks | Maintain continuity | Shift by 50 words |
| 23 | Semantic Compression | Reduce text | Save cost | Summarized chunk |
| 24 | Query-Aware Chunking | Based on queries | Better relevance | FAQ chunks |
| 25 | Data Augmentation | Add examples | Better coverage | Q&A pairs |
| 26 | Title Injection | Add titles | Better context | Python Basics |
| 27 | Source Attribution | Store source | Trust & UI | pdf1 |
| 28 | Format Standardization | Consistent schema | Easier processing | JSON format |
| 29 | Multilingual Handling | Handle languages | Better retrieval | EN vs Telugu |
| 30 | Index Optimization | Optimize index | Faster search | Mappings |