🍔 Biteflation #2: The art of menu data wrangling

24 Oct 2024

After collecting hundreds of restaurant menu files, I found myself stuck. Each JSON file contained a deeply nested data structure that would make my device hang when I tried to open it. The sheer volume of data – over 1GB of plain text files – made manual analysis impractical.

The breakthrough came when I discovered Microsoft Fabric, an all-in-one analytics platform that leverages familiar tools like Python, SQL, and Power Query. Coincidentally, Microsoft was running a Cloud Skills Challenge that offered a certification for completing a Fabric learning path and passing an exam. This presented the perfect opportunity to learn a new platform while reviving a shelved project.

To avoid getting overwhelmed, I started with a clear minimum viable product: create a dataset tracking price updates for each product variation across all restaurants in my sample. The essential data points would include:

Restaurant name
Product name
Product variation (if any)
Date of price record
Price amount

Since I had daily data for these restaurants, I decided to focus on capturing just the first day when each price changed – a key simplification that would make the analysis more manageable.

Technical deep dive: Taming complex JSON data

The JSON structure presented several challenges. Each menu was deeply nested, in a format similar to this:

{
  "data": {
    "menus": {
      "menu_categories": [
        {
          "name": "Snacks",  
          "items": {
            "products": [
              {
                "name": "Fries",
                "product_variations": [
                  {
                    "name": "Large",
                    "base_price": "...",
                    "topping_ids": ["..."]
                  }
                ]
              }
            ]
          }
        }
      ],
      "topping_classes": [
        {
          "name": "Fries Add-ons",
          "options": [
            {
              "name": "Cheese",
              "addon_price": "..."
            }
          ]
        }
      ]
    }
  }
}

For simplification, I focused on base prices at the product variation level, setting aside the complexity of topping options. This decision came with trade-offs – we might miss price changes hidden in menu restructuring, such as when restaurants move items between variation types or convert standalone items into topping options.

My approach followed an adapted medallion architecture, processing the data through three increasingly refined stages:

Bronze Layer

Raw JSON flattening & initial extraction

Silver Layer

Data cleaning & standardization

Gold Layer

Analysis-ready dataset

The medallion architecture, popularized by Databricks and adopted by Microsoft Fabric, is a data processing pattern that progressively refines raw data through three layers: bronze (raw), silver (cleaned), and gold (business-ready). This approach, inspired by traditional data warehousing principles, helps manage complexity while ensuring data quality and maintainability.

Bronze stage: Initial data extraction

I began by working with a small subset of restaurant-date combinations to validate my approach before scaling up. Using PySpark, I created a “bronze” dataset that simply flattened the nested JSON structure while preserving all original fields.

# Define menus_df as the first item in df.data.menus
menus_df = df.select("filename", "data.url_key", "data.code", explode_outer("data.menus").alias("menus"))
# Flatten the first layer of menus_df without using the function
menus_df = menus_df.select("filename", "url_key", col("code").alias("restaurant_code"), "menus.*")
display(menus_df)

categories_df = menus_df.select("filename", "url_key", "restaurant_code", explode_outer("menu_categories").alias("menu_categories"))
categories_df = flatten_dataframe(categories_df)
display(categories_df)

# For each item in the menu_categories_products column/array in categories_df, create a new row in products_df. Each row in products_df should contain the column menu_categories_code from categories_df, as well as all columns from the menu_categories_products column/array in categories_df.
products_df = categories_df.select("filename", "menu_categories_id", "menu_categories_products")
products_df = products_df.select("filename", explode_outer("menu_categories_products").alias("menu_categories_products"), "menu_categories_id")
products_df = flatten_dataframe(products_df)
products_df = renameColumnPrefix(products_df, "menu_categories_products_")
display(products_df)

variations_df = products_df.select(explode_outer("product_variations").alias("variation_data"), "*").drop("product_variations")
variations_df = flatten_dataframe(variations_df)
display(variations_df)

variations_df.write.format("parquet").mode("overwrite").saveAsTable("variations")

The silver stage involved SQL transformations to:

Remove unnecessary columns
Filter out null/missing values
Standardize data formats

CREATE OR REPLACE TABLE SILVER_MENU_DATA.variations
USING DELTA
LOCATION 'Tables/variations'
AS
WITH extracted_data AS (
  SELECT
    *,
    -- Extract the last part of the filename (after the last '/')
    regexp_extract(filename, '/([^/]+)$', 1) AS short_filename
  FROM
    BRONZE_MENU_DATA.variations
)
SELECT
    short_filename,
    menu_categories_id,
    id,
    code,
    name,
    file_path,
    logo_path,
    variation_data_code,
    variation_data_id,
    variation_data_name,
    variation_data_price    
FROM
    extracted_data
WHERE
    menu_categories_id IS NOT NULL 
    AND id IS NOT NULL

Gold stage: Analysis-ready dataset

The final gold dataset was prepared using Power Query, where I:

Removed duplicate prices (keeping only the first date of each price change)
Retained only the MVP columns defined above apart from some identifiers
Created the final structure used for visualization

Data cleaning challenges

The most significant challenge was handling inconsistent item names across time. For example, a restaurant might list the same item as:

“Spaghetti (Solo)”
“Spaghetti Solo”
“Spaghetti (Ala Carte)”

To address this, I used the combination of item ID and restaurant name as the unique identifier, rather than relying on item names. While this approach handles direct name changes, it doesn’t capture more complex scenarios such as:

Deleted items reappearing with new IDs
Changes in item hierarchy (e.g., splitting a single item into variations)
Menu restructuring

A future enhancement could leverage AI to identify and reconcile these more complex naming and structural changes, enabling more accurate price trend analysis.

Leveraging Microsoft Fabric

Several features of Microsoft Fabric proved particularly valuable:

Cloud processing. The ability to run data transformations in the cloud meant I could process large datasets in the background without overwhelming my local machine.
VS Code integration. Seamless integration with Visual Studio Code maintained my preferred development environment while accessing Fabric’s capabilities.
Power Query. The low-code approach to data transformations made common operations accessible, especially useful for those newer to SQL or Python.
End-to-end analytics. While I ultimately used Tableau for visualization (due to trial limitations), Fabric’s direct integration with Power BI showcases its potential as a complete analytics solution.

Alternative visualization created using the Power BI service. It was very convenient to build as it was directly connected to the Fabric data warehouse.

Though I didn’t get to explore it during my trial, Microsoft’s Copilot integration could potentially streamline future analysis workflows.

Early insights

Initial analysis reveals several interesting patterns:

Inflationary pressure. The general trend shows steady price increases across items and restaurants, reflecting broader economic pressures.
Strategic pricing. Core items like fried chicken show remarkable price stability, occasionally even decreasing. This suggests their use as loss leaders to drive customer traffic.
Competitive dynamics. Similar price movements across restaurants, particularly evident in sundae pricing, hint at active competitive monitoring and matching.

Next steps

With our foundation of clean, structured data in place, we’re ready to dive deeper into the key questions:

What drives the timing of price changes?
Which menu items see the most frequent updates?
Do restaurants respond to competitors’ price changes?

Stay tuned for Part 3, where we’ll explore these questions and more through detailed analysis.

You can view the Tableau Public visualization here. You can also browse the GitHub repo.

🍔 Biteflation #1: Where the project started

27 Jul 2024

As a frequent consumer of food delivery, I noticed that my budget took a hit with every price hike, which seemed to happen more often as we transitioned from the pandemic to the new normal.

This led me to the question: how much do fast food prices increase over a period of time? Are price hikes proportionate with headline inflation (or with overall food inflation), or do they tend to be much higher, compensating for the fact that restaurant prices are typically more “sticky” than groceries?

It turns out that there is no easy answer due to limited data availability, which obscures the true impact of the price hikes on consumers. On an aggregate level, the Philippine Statistics Authority reports on the price index for “food and beverage serving services,” yet no distinction is made between full-service and limited-service. Additionally, aggregated data may mask significant variations in pricing strategies among different restaurant types.

Restaurant-level data, at least for the most popular chains, would prove useful in gaining a more nuanced perspective. There are two issues, however:

Historical restaurant menu data is generally unavailable, at least in a consistent format suitable for data analysis. Makes total sense — most restaurants don’t want to draw attention to the fact that they’re increasing prices. There isn’t a standardized menu layout or repository in the industry.
If prices for each menu item were available, the problem now may be too much granularity. It would be challenging to create a composite price index per restaurant or for the entire Philippine fast food industry in general because relative quantity demand for each individual item would not be as apparent than published prices.

Leveraging fast food aggregator data

The Biteflation project was born when I saw an opportunity to address the first problem using fast food aggregator data.

For restaurants to be listed in an aggregator’s delivery service, menus need to be stored in a consistent data format. Unlike menus on the company’s website or social media pages which are sometimes outdated, we can reasonably expect the service to reflect the latest prices. This would basically allow one to keep track of menu item prices daily and note any changes.

However, getting data from a food delivery app would still be a painstakingly manual process, especially with the hundreds of products and variations across restaurants — solo, small meal, medium meal, large meal, etc. Interestingly, one food aggregator’s hidden web API provided a structured JSON response containing all the menu data.

Long story short, a structured version of the restaurant data can be obtained using existing Web technologies. Yes, the structure was very complicated as it was deeply nested and had a convoluted way of storing some data (particularly product variations and add-ons!), but all the menu data was there. Data transformation and processing could come later.

All that was needed is a mechanism to download the data and store it automatically on a daily basis. A simple Python script and GitHub Actions allowed for just that.

Soon enough, I had more than 500 days of *.json files for each restaurant. This ended up taking >1GB of storage just for plain text restaurant data. Certainly, the data download process could have been more efficient, with new *.json files stored only when there were actual changes in the menu contents. Nevertheless, keeping the raw data intact also provides the most flexibility for data manipulation later on.

And two years ago, I didn’t really know how to process all the data… 😅

All I knew was this could potentially lead to more interesting insights into pricing strategy, such as:

when restaurants tend to implement price changes — is there a particular day (1st of the month) or cadence (every x months)?
what menu items often change in price and to what extent — do restaurants focus on core/add-on items for their price increases? do they change the prices of all their items simultaneously or take a staggered approach? are price changes in response to supply chain shocks related to direct raw materials?
how restaurants react to changes in competitor prices — do restaurants increase their prices in response to their competition? if so, in what timeframe and in what proportion?

This is the first part in a series of articles sharing my data science learning journey through the Biteflation project. Next time, I’ll delve into how I processed the data and generated the dashboard below.