Your complete beginner's guide — explained from zero, in plain English.
pandas is a free, open-source Python library created specifically for working with structured data — data that looks like a table with rows and columns, just like an Excel spreadsheet or a database table. The name comes from "Panel Data" — a term used in statistics and economics. It was built by Wes McKinney in 2008 and is now the most popular data tool in Python.
| ID | Name | Department | City | Salary | Join Date | |
|---|---|---|---|---|---|---|
| 0 | 1 | Alice | HR | Bangalore | 60000 | 2023-01-15 |
| 1 | 2 | Bob | IT | Hyderabad | 75000 | 2022-07-23 |
| 2 | 3 | Charlie | Finance | Mumbai | 62000 | 2021-11-10 |
| 3 | 4 | David | IT | Pune | 82000 | 2023-03-05 |
| 4 | 5 | Eve | Sales | Bangalore | 55000 | 2022-01-20 |
| 5 | 6 | Frank | Finance | Delhi | NULL | 2021-06-18 |
| 6 | 7 | Grace | HR | Hyderabad | 65000 | 2023-02-28 |
| 7 | 8 | Henry | Sales | Mumbai | 58000 | 2022-09-12 |
# Step 1 — import the library (always the first line) import pandas as pd # "pd" is a universal shortcut everyone uses # Step 2 — create your first DataFrame from a Python list of dicts data = [ {"Name": "Alice", "Department": "HR", "Salary": 60000}, {"Name": "Bob", "Department": "IT", "Salary": 75000}, {"Name": "Charlie", "Department": "Finance", "Salary": 62000}, ] df = pd.DataFrame(data) # convert list → pandas table print(df) # action — shows the table
Name Department Salary
0 Alice HR 60000
1 Bob IT 75000
2 Charlie Finance 62000
Before learning how to use pandas, you need to understand why it matters. Every company — from startups to MNCs — generates data every single day. Someone has to clean it, analyse it, and turn it into useful information. That person is you, and pandas is your most important tool.
Every app, website, and business creates data — orders, clicks, transactions, logs. pandas lets you make sense of all of it.
Data Engineer, Data Analyst, Data Scientist — all of these roles use pandas daily. It is listed in almost every data job description in India.
Excel crashes at 1 million rows. pandas handles 10 million rows in seconds. One script replaces hours of manual Excel work.
pandas is the foundation for PySpark, Machine Learning (sklearn), and Data Visualisation (Matplotlib, Seaborn). Learn pandas first, everything else becomes easier.
In real projects, 70% of time is spent cleaning data — fixing nulls, removing duplicates, correcting formats. pandas is built for exactly this.
CSV, Excel, JSON, SQL databases, Parquet — one line of pandas code reads any format. No extra tools needed.
Who uses pandas in their daily work?
These are the 14 capabilities introduced in today's class. For each one, you will see what it means, how to write it, and what output it gives.
Data Analysis means looking at your data to discover what it is telling you. Before writing any code that changes data, you first explore it — how many rows? What columns? Any missing values? What is the average salary?
import pandas as pd df = pd.read_csv("employees.csv") print(df.shape) # → (8, 6) — 8 rows, 6 columns print(df.info()) # column names, types, nulls print(df.head(3)) # first 3 rows print(df.describe()) # count, mean, min, max of numeric cols
Transformation means reshaping or creating new data from existing data. This is how you add business logic — calculate tax, classify employees, extract year from a date, etc.
# Add a new "Tax" column df["Tax"] = df["Salary"] * 0.10 # Add "Level" column using a condition df["Level"] = df["Salary"].apply(lambda x: "Senior" if x > 70000 else "Junior") # Convert Join Date from text to real date df["Join Date"] = pd.to_datetime(df["Join Date"]) df["Join Year"] = df["Join Date"].dt.year print(df[["Name", "Salary", "Tax", "Level", "Join Year"]])
Name Salary Tax Level Join Year
0 Alice 60000 6000.0 Junior 2023
1 Bob 75000 7500.0 Senior 2022
2 Charlie 62000 6200.0 Junior 2021
3 David 82000 8200.0 Senior 2023
In real-world data, problems are always present — missing values, duplicate entries, wrong formats. Data Cleaning fixes these issues before analysis so your results are accurate.
# 1. Find missing values print(df.isnull().sum()) # Salary 1 ← Frank has NULL salary # 2. Fill NULL salary with dept average avg_salary = df["Salary"].mean() df["Salary"] = df["Salary"].fillna(avg_salary) # 3. Remove duplicate rows df = df.drop_duplicates() # 4. Fix inconsistent text (strip spaces + lowercase) df["Department"] = df["Department"].str.strip().str.title() print("Nulls remaining:", df.isnull().sum().sum())
Filtering = picking rows that match a condition (like SQL WHERE). Aggregation = summarising many rows into one number — total, average, count, max.
# FILTERING — get IT department employees with salary > 70000 it_high = df[(df["Department"] == "IT") & (df["Salary"] > 70000)] print(it_high[["Name", "Salary"]]) # AGGREGATION — average salary per department dept_stats = df.groupby("Department")["Salary"].agg( avg_salary = "mean", max_salary = "max", headcount = "count" ).reset_index() print(dept_stats)
Department avg_salary max_salary headcount 0 Finance 62000 62000 2 1 HR 62500 65000 2 2 IT 78500 82000 2 3 Sales 56500 58000 2
Grouping = split data into categories and summarise each group (like SQL GROUP BY). Sorting = ordering rows so the highest, lowest, or alphabetical item appears first.
# GROUPING — how many employees per city? city_count = df.groupby("City")["Name"].count() print(city_count) # SORTING — top 3 highest paid employees top3 = df.sort_values("Salary", ascending=False).head(3) print(top3[["Name", "Department", "Salary"]])
Name Department Salary
3 David IT 82000
1 Bob IT 75000
6 Grace HR 65000
These are the professional-level applications of pandas you will use in real jobs:
📁 File Operations — read CSV / Excel / JSON / Parquet / SQL in one line, write results back.
🔄 ETL Development — Extract data from a source, Transform it (clean, join, calculate), Load it to a destination. This is what Data Engineers do every day.
📅 Time Series — analyse trends over time, resample from daily to monthly, compute rolling averages.
🤖 ML Data Preparation — before training any AI/ML model, you must prepare features using pandas — encode categories, scale numbers, split train/test data.
# ── File Operations ────────────────────────────────────────── df = pd.read_csv("raw_sales.csv") # Extract df.to_excel("clean_sales.xlsx", index=False) # Save as Excel df.to_parquet("sales.parquet") # Save as Parquet (AWS/cloud) # ── ETL Pipeline example ────────────────────────────────────── raw = pd.read_csv("orders.csv") # Extract clean = raw.dropna().drop_duplicates() # Transform clean["Revenue"] = clean["Qty"] * clean["Price"] clean.to_csv("output/clean_orders.csv") # Load # ── Time Series — monthly revenue trend ────────────────────── clean["Date"] = pd.to_datetime(clean["Date"]) monthly = clean.resample("M", on="Date")["Revenue"].sum() # ── ML Prep — encode categorical columns ───────────────────── df_encoded = pd.get_dummies(df, columns=["Department", "City"]) # Department_HR=1, Department_IT=1… ready for ML model input
This is not a skill you learn and forget. Here is exactly where and how pandas appears in real careers.
Real projects where pandas is used: