pandas is a free, open-source Python library created specifically for working with structured data — data that looks like a table with rows and columns, just like an Excel spreadsheet or a database table. The name comes from "Panel Data" — a term used in statistics and economics. It was built by Wes McKinney in 2008 and is now the most popular data tool in Python.

💡

Real-world analogy for freshers:
Think of pandas like a super-powered Excel that you control with Python code. In Excel, you click buttons to sort, filter, and calculate. In pandas, you write a few lines of Python and the same work happens — but on millions of rows, in seconds, automatically. And unlike Excel, you can save those steps as a script and reuse them forever.

▶ This is a pandas DataFrame — a table in Python memory

	ID	Name	Department	City	Salary	Join Date
0	1	Alice	HR	Bangalore	60000	2023-01-15
1	2	Bob	IT	Hyderabad	75000	2022-07-23
2	3	Charlie	Finance	Mumbai	62000	2021-11-10
3	4	David	IT	Pune	82000	2023-03-05
4	5	Eve	Sales	Bangalore	55000	2022-01-20
5	6	Frank	Finance	Delhi	NULL	2021-06-18
6	7	Grace	HR	Hyderabad	65000	2023-02-28
7	8	Henry	Sales	Mumbai	58000	2022-09-12

📌 Key Terms: The rows are called records / observations. The columns are called features / fields / attributes. The left-side numbers (0,1,2…) are the index — pandas' way of labelling each row. The red NULL cell is a missing value — one of the most common real-world data problems pandas helps you solve.

Terminal — Install pandas

pip install pandas

hello_pandas.py

# Step 1 — import the library (always the first line)
import pandas as pd       # "pd" is a universal shortcut everyone uses

# Step 2 — create your first DataFrame from a Python list of dicts
data = [
    {"Name": "Alice",   "Department": "HR",      "Salary": 60000},
    {"Name": "Bob",     "Department": "IT",      "Salary": 75000},
    {"Name": "Charlie", "Department": "Finance", "Salary": 62000},
]

df = pd.DataFrame(data)   # convert list → pandas table

print(df)   # action — shows the table

▶ Output

      Name Department  Salary
0    Alice         HR   60000
1      Bob         IT   75000
2  Charlie    Finance   62000

Section 02

Why should you learn pandas?

Before learning how to use pandas, you need to understand why it matters. Every company — from startups to MNCs — generates data every single day. Someone has to clean it, analyse it, and turn it into useful information. That person is you, and pandas is your most important tool.

REASON 01

📊 Data is Everywhere

Every app, website, and business creates data — orders, clicks, transactions, logs. pandas lets you make sense of all of it.

REASON 02

💼 High Demand Jobs

Data Engineer, Data Analyst, Data Scientist — all of these roles use pandas daily. It is listed in almost every data job description in India.

REASON 03

⚡ 10× Faster than Excel

Excel crashes at 1 million rows. pandas handles 10 million rows in seconds. One script replaces hours of manual Excel work.

REASON 04

🔗 Gateway to Big Tech

pandas is the foundation for PySpark, Machine Learning (sklearn), and Data Visualisation (Matplotlib, Seaborn). Learn pandas first, everything else becomes easier.

REASON 05

🧹 Real Data is Messy

In real projects, 70% of time is spent cleaning data — fixing nulls, removing duplicates, correcting formats. pandas is built for exactly this.

REASON 06

🌐 Reads Any File

CSV, Excel, JSON, SQL databases, Parquet — one line of pandas code reads any format. No extra tools needed.

🎯 Remember this: In your first job interview, if you say "I know pandas and I can clean, analyse and transform data with it" — that immediately makes you stand out from freshers who only know theory.

Who uses pandas in their daily work?

Data Engineers

Data Analysts

Data Scientists

BI Developers

ML Engineers

Database Developers

Backend Developers

Section 03

14 Things pandas can do — explained with code

These are the 14 capabilities introduced in today's class. For each one, you will see what it means, how to write it, and what output it gives.

📊

CONCEPT 01

Data Analysis — Explore & Understand Patterns

Data Analysis means looking at your data to discover what it is telling you. Before writing any code that changes data, you first explore it — how many rows? What columns? Any missing values? What is the average salary?

01_data_analysis.py

import pandas as pd

df = pd.read_csv("employees.csv")

print(df.shape)       # → (8, 6) — 8 rows, 6 columns
print(df.info())      # column names, types, nulls
print(df.head(3))    # first 3 rows
print(df.describe()) # count, mean, min, max of numeric cols

🔧

CONCEPT 02

Data Transformation — Create & Modify Columns

Transformation means reshaping or creating new data from existing data. This is how you add business logic — calculate tax, classify employees, extract year from a date, etc.

02_transformation.py

# Add a new "Tax" column
df["Tax"] = df["Salary"] * 0.10

# Add "Level" column using a condition
df["Level"] = df["Salary"].apply(lambda x: "Senior" if x > 70000 else "Junior")

# Convert Join Date from text to real date
df["Join Date"] = pd.to_datetime(df["Join Date"])
df["Join Year"] = df["Join Date"].dt.year

print(df[["Name", "Salary", "Tax", "Level", "Join Year"]])

▶ Output

      Name  Salary     Tax   Level  Join Year
0    Alice   60000  6000.0  Junior       2023
1      Bob   75000  7500.0  Senior       2022
2  Charlie   62000  6200.0  Junior       2021
3    David   82000  8200.0  Senior       2023

🧹

CONCEPT 03

Data Cleaning — Handle Nulls & Duplicates

In real-world data, problems are always present — missing values, duplicate entries, wrong formats. Data Cleaning fixes these issues before analysis so your results are accurate.

03_cleaning.py

# 1. Find missing values
print(df.isnull().sum())
# Salary    1  ← Frank has NULL salary

# 2. Fill NULL salary with dept average
avg_salary = df["Salary"].mean()
df["Salary"] = df["Salary"].fillna(avg_salary)

# 3. Remove duplicate rows
df = df.drop_duplicates()

# 4. Fix inconsistent text (strip spaces + lowercase)
df["Department"] = df["Department"].str.strip().str.title()

print("Nulls remaining:", df.isnull().sum().sum())

🔍

CONCEPT 04 & 05

Data Filtering & Aggregation

Filtering = picking rows that match a condition (like SQL WHERE). Aggregation = summarising many rows into one number — total, average, count, max.

04_05_filter_agg.py

# FILTERING — get IT department employees with salary > 70000
it_high = df[(df["Department"] == "IT") & (df["Salary"] > 70000)]
print(it_high[["Name", "Salary"]])

# AGGREGATION — average salary per department
dept_stats = df.groupby("Department")["Salary"].agg(
    avg_salary = "mean",
    max_salary = "max",
    headcount   = "count"
).reset_index()
print(dept_stats)

▶ Output — dept_stats

  Department  avg_salary  max_salary  headcount
0    Finance       62000       62000          2
1         HR       62500       65000          2
2         IT       78500       82000          2
3      Sales       56500       58000          2

🗂️

CONCEPT 06 & 08

Grouping & Sorting

Grouping = split data into categories and summarise each group (like SQL GROUP BY). Sorting = ordering rows so the highest, lowest, or alphabetical item appears first.

06_08_group_sort.py

# GROUPING — how many employees per city?
city_count = df.groupby("City")["Name"].count()
print(city_count)

# SORTING — top 3 highest paid employees
top3 = df.sort_values("Salary", ascending=False).head(3)
print(top3[["Name", "Department", "Salary"]])

▶ Output — top 3 highest paid

    Name Department  Salary
3  David         IT   82000
1    Bob         IT   75000
6  Grace         HR   65000

📁

CONCEPT 11 · 12 · 13 · 14

File Ops · ETL · Time Series · ML Data Prep

These are the professional-level applications of pandas you will use in real jobs:

📁 File Operations — read CSV / Excel / JSON / Parquet / SQL in one line, write results back.

🔄 ETL Development — Extract data from a source, Transform it (clean, join, calculate), Load it to a destination. This is what Data Engineers do every day.

📅 Time Series — analyse trends over time, resample from daily to monthly, compute rolling averages.

🤖 ML Data Preparation — before training any AI/ML model, you must prepare features using pandas — encode categories, scale numbers, split train/test data.

11_14_advanced.py

# ── File Operations ──────────────────────────────────────────
df = pd.read_csv("raw_sales.csv")          # Extract
df.to_excel("clean_sales.xlsx", index=False) # Save as Excel
df.to_parquet("sales.parquet")              # Save as Parquet (AWS/cloud)

# ── ETL Pipeline example ──────────────────────────────────────
raw   = pd.read_csv("orders.csv")             # Extract
clean = raw.dropna().drop_duplicates()        # Transform
clean["Revenue"] = clean["Qty"] * clean["Price"]
clean.to_csv("output/clean_orders.csv")      # Load

# ── Time Series — monthly revenue trend ──────────────────────
clean["Date"] = pd.to_datetime(clean["Date"])
monthly = clean.resample("M", on="Date")["Revenue"].sum()

# ── ML Prep — encode categorical columns ─────────────────────
df_encoded = pd.get_dummies(df, columns=["Department", "City"])
# Department_HR=1, Department_IT=1… ready for ML model input