top of page
Search

From Chaos to Clarity: How We Built a Data Warehouse

  • Writer: Jon Tanton Brown
    Jon Tanton Brown
  • Apr 13
  • 5 min read

Retail is a data jungle. Customer transactions, inventory logs, marketing campaigns—every day retailers generate mountains of information. But without a clear system to organise it, that data is just noise. At Pangaea Analytics, we turn chaos into clarity with a robust data warehouse that powers analytics for some of the UK’s biggest retailers.


In this post, I’ll walk you through how we build a data warehouse that transforms raw data into actionable insights, using a real-world process we’ve honed over years. Whether you’re a retail exec drowning in spreadsheets or a data lead curious about our tech, this is the blueprint we use to make data work.


The Problem: Data That Doesn’t Talk

Imagine a retailer with sales data in one system, web analytics in another, customer profiles in another, and inventory updates in yet another. Sound familiar? That’s the reality for many brands before they work with us. Without a unified system, reporting is slow, insights are patchy, and decisions feel like guesswork. Our mission at Pangaea Analytics is to fix that by building a data warehouse—a single source of truth that lets retailers slice, dice, and act on their data with confidence.


We’ve built data warehouses for household names, enabling everything from self-service reporting to laser-focused customer segmentation. Here’s how we do it, step by step, using tools like Python, Exasol, and MicroStrategy (now rebranded Strategy), all wrapped in a process that’s reliable and transparent.


Step 1: Connecting to the Source with Python

Every data warehouse starts with raw data, and retail systems are rarely tidy. Point-of-sale platforms, CRM tools, and marketing databases all speak different languages. To bring them together, we use Python for its flexibility and power.


Our process begins by connecting to each data source—whether it’s an API, a database, a flat file in SFTP, or S3, or even a legacy system. Using Python libraries like pandas and requests, we pull the data, usually in 2 ways:

  • Flat Files for Archiving: We extract data into flat files (like CSVs), which are then archived securely in an Azure File Share. This gives us a historical backup, perfect for audits or future analysis. It also means should the need arise, we can rebuild entire data warehouses from scratch.

  • Direct Data Frames: For faster pipelines, we load data straight into a pandas DataFrame in memory, ready for the next step.


This approach balances speed and reliability. For example, when working with a retailer’s transaction data, we might archive daily sales logs in Azure while streaming customer interactions directly into our pipeline. Python’s versatility lets us handle whatever mess the source systems throw our way.


Step 2: Staging the Data in Exasol

Once we’ve got the raw data, it’s time to stage it in our data warehouse. We use Exasol, a high-performance analytical database, for its speed and scalability—crucial when dealing with retail’s massive datasets. To load data into Exasol, we rely on the pyexasol Python library, which lets us push DataFrames or files directly into staging tables.


These staging tables are like a holding area. They store raw data exactly as it comes, no transformations yet. This ensures we have an untouched copy in case anything goes wrong downstream. For instance, if we’re pulling sales data the staging layer captures every transaction, timestamp, and SKU before we start cleaning or aggregating. It’s a safety net that lets us move fast without risking data loss.


Step 3: Transforming to the ODS Layer (Layer 2)

Here’s where the magic starts. Our Operational Data Store (ODS), or Layer 2, is where we clean, structure, and historize the data. We call it the “history keeper” because it tracks changes over time—vital for retailers who need to see how customer behaviour or inventory shifts.


We run a series of SQL scripts, stored locally within our Python program, to transform the staging data into Layer 2. These scripts do the heavy lifting:

  • Cleaning: Removing duplicates, fixing nulls, standardizing formats (e.g., ensuring dates are consistent).

  • History of Change: Capturing changes, like when a customer updates their profile or a product’s price shifts.


Python orchestrates this by triggering the SQL scripts via pyexasol, ensuring everything runs smoothly. For example, when building a warehouse for a fashion retailer, Layer 2 might track every change in a customer’s purchase history, letting us analyse trends over months or years. This layer is the foundation of the warehouse, keeping the raw data with a history of change.


Step 4: Crafting the Presentation Layer (Layer 3)

Now we’re ready for the star of the show: Layer 3, the presentation layer. This is where data becomes insights. Using SQL, we transform Layer 2’s operational tables into a structure optimized for analytics, based on the Kimball methodology—a gold standard for data warehousing.


The Kimball approach organizes data into facts (measurable events, like sales or returns) and dimensions (descriptive attributes, like customers, products, or stores). For instance:

  • A fact table might record every sale, with columns for revenue, quantity, and timestamp.

  • Dimension tables add context, like customer demographics or store locations.


This setup makes it easy for tools like MicroStrategy to slice and dice the data. Retailers can answer questions like “Which stores had the highest sales last week?” or “How do loyalty members shop differently?” with a few clicks.


It's the presentation layer that can power mobile dashboards and self-service BI tools, letting teams explore data without needing a PhD in SQL.


Step 5: Keeping It All Under Control

A data warehouse isn’t just about data—it’s about trust. Retailers rely on us to deliver accurate, timely insights, so we’ve built safeguards to ensure our pipeline never skips a beat.

  • Event Logging: We maintain a local log of every step—when data is pulled, when scripts run, how long each process takes. This helps us spot bottlenecks (e.g., a slow API call) and optimize performance.

  • Control Tables: We record key metadata in an Exasol control table, like the last successful load or the number of rows processed. This is a critical requirement for dependent processes. For example, if another process needs to run after our warehouse updates, they can check the control table to confirm it’s ready.

  • The Batphone: If something goes wrong—like a failed load or a data mismatch—our system alerts us instantly. We’ve set up a VOIP number (affectionately called the Batphone) that calls our team, day or night, so we can jump in and fix issues before they impact the client.


These controls mean retailers can trust the warehouse to be their single source of truth, whether they’re checking stock levels or segmenting customers with RFM models.


From Chaos to Clarity: The Impact

So, what happens when the warehouse is up and running? For our clients, it’s a game-changer. Data that was once scattered across systems is now unified, historical trends are at their fingertips, and analytics tools like MicroStrategy light up with insights. Teams can build reports in minutes, not days. Marketing can target customers with precision, using models like RFM built on our Single Customer View. And execs can make decisions based on facts, not hunches.


At Pangaea Analytics, we’ve seen this transformation first-hand. Our data warehouse isn’t just a technical solution—it’s a foundation for growth. It’s how retailers move from surviving to thriving in a competitive world.


Want to Unlock Your Data?

Building a data warehouse isn’t easy, but it’s worth it. If your data feels like chaos, we can help you find clarity. At Pangaea Analytics, we specialize in turning complex systems into powerful insights, using tools like Python, Exasol, and MicroStrategy to deliver results that stick.


Curious how a data warehouse could transform your business? Drop me a line at jon@pangaeaanalytics.co.uk or visit https://www.pangaeaanalytics.co.uk/.


Let’s make your data work as hard as you do.

 
 
 

Recent Posts

See All

Comments


Links

About

Services

Contact

Contact

Social

jon@pangaeaanalytics.co.uk

 

Tel: 07730506454

 

Pangaea Analytics is incorporated in England and Wales 

 

Company No: 11935742

  • LinkedIn
bottom of page