Medallion Architecture is a best-practice data design pattern used in Databricks to logically organize data into multiple layers to improve data quality progressively. It helps teams build reliable, scalable, and maintainable data pipelines.
The Bronze layer stores raw data exactly as it is received from source systems. No transformations are applied except minimal metadata additions.
# Bronze Layer - Streaming Ingestion Example
from pyspark.sql.functions import *
raw_df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load("/mnt/raw/transactions")
raw_df.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/bronze") \
.table("bronze_transactions")
Silver layer cleans, filters, validates, and enriches data. Invalid records are removed, null values handled, and schema is enforced.
# Silver Layer - Cleaning Example
from pyspark.sql.functions import col
bronze_df = spark.read.table("bronze_transactions")
silver_df = bronze_df \
.filter(col("amount") > 0) \
.dropDuplicates(["transaction_id"]) \
.withColumn("processed_date", current_timestamp())
silver_df.write.format("delta").mode("overwrite").saveAsTable("silver_transactions")
Gold layer provides aggregated and business-ready data for dashboards, reporting, ML models, and analytics.
-- Gold Layer - Aggregation Example
CREATE OR REPLACE TABLE gold_daily_sales AS
SELECT
DATE(transaction_date) AS sale_date,
SUM(amount) AS total_sales,
COUNT(*) AS total_transactions
FROM silver_transactions
GROUP BY DATE(transaction_date);
Consider a Banking Transaction Processing System:
Streaming ingestion ensures real-time dashboards update every few seconds.