Grocery Simulation Databricks
Featured ProjectTimeline: May 2026 - May 2026
Project Type: Personal
Databricks Python
Project Description
Grocery Store Databricks is a data engineering portfolio project that demonstrates a production-style Lakehouse pipeline built on Databricks, processing 6,668 simulated days of grocery store transactional data through a three-layer Medallion Architecture. The dataset originates from a custom PostgreSQL OLAP simulation previously running on a Raspberry Pi, exported as a star schema consisting of five dimension tables and one fact table at the line-item grain, totaling over 700MB of raw data. The pipeline is implemented across three Databricks notebooks that sequentially ingest, clean, and aggregate the data using PySpark and Delta Lake, with each layer stored as managed tables in Unity Catalog under a dedicated catalog and schema hierarchy. The Bronze layer ingests raw CSVs from a Unity Catalog Volume into Delta tables using explicit schema definitions to enforce data integrity at the source. The Silver layer applies data quality filters, standardizes string formatting, derives calculated measures including profit margin percentage, and denormalizes dimension labels onto the fact table to produce a single enriched source of truth. The Gold layer produces seven business-level aggregate tables covering daily revenue and profit trends, product category performance, rush hour versus non-rush hour sales patterns, loyalty versus walk-in customer behavior, top products by profit, payment method breakdown, and employee performance ranked within role. The project demonstrates end-to-end data pipeline development with PySpark, Delta Lake table management and schema enforcement, Medallion Architecture design principles, and hands-on experience with modern Databricks features including Unity Catalog and Volumes.