- Home
- Course Detail
regularpython@gmail.com
You are now watching:
Databricks Introduction / of Medallion Architecture
Databricks Medallion Architecture
Medallion Architecture is a best-practice data design pattern used in Databricks to logically organize data into multiple layers to improve data quality progressively. It helps teams build reliable, scalable, and maintainable data pipelines.
1. Bronze Layer (Raw Data)
The Bronze layer stores raw data exactly as it is received from source systems. No transformations are applied except minimal metadata additions.
- Stores raw ingestion data
- Supports streaming & batch ingestion
- Acts as single source of truth
- Enables replay capability
# Bronze Layer - Streaming Ingestion Example
from pyspark.sql.functions import *
raw_df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load("/mnt/raw/transactions")
raw_df.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/bronze") \
.table("bronze_transactions")
2. Silver Layer (Cleaned & Enriched Data)
Silver layer cleans, filters, validates, and enriches data. Invalid records are removed, null values handled, and schema is enforced.
- Data validation
- Schema enforcement
- Deduplication
- Data enrichment
# Silver Layer - Cleaning Example
from pyspark.sql.functions import col
bronze_df = spark.read.table("bronze_transactions")
silver_df = bronze_df \
.filter(col("amount") > 0) \
.dropDuplicates(["transaction_id"]) \
.withColumn("processed_date", current_timestamp())
silver_df.write.format("delta").mode("overwrite").saveAsTable("silver_transactions")
3. Gold Layer (Business-Level Data)
Gold layer provides aggregated and business-ready data for dashboards, reporting, ML models, and analytics.
- Business KPIs
- Aggregations
- Dimension & Fact tables
- Optimized for BI tools
-- Gold Layer - Aggregation Example
CREATE OR REPLACE TABLE gold_daily_sales AS
SELECT
DATE(transaction_date) AS sale_date,
SUM(amount) AS total_sales,
COUNT(*) AS total_transactions
FROM silver_transactions
GROUP BY DATE(transaction_date);
Real-Time Banking Scenario
Consider a Banking Transaction Processing System:
- Bronze: All ATM, UPI, Card transactions stored raw
- Silver: Fraud checks, invalid transaction removal
- Gold: Daily revenue dashboard, fraud summary reports
Streaming ingestion ensures real-time dashboards update every few seconds.
Usages of Medallion Architecture
- Real-time analytics systems
- Enterprise Data Warehousing
- Machine Learning pipelines
- Fraud detection systems
- Customer 360 platforms
Improves scalability, data governance, reliability, and simplifies debugging.