Data Engineering Interview Crash Workbook: Answer Guide

This page is a separate expanded companion to the original workbook at Data Engineering Interview Crash Workbook. It turns the prompt-style notes into interview-ready answers with examples, trade-offs, and a structure that reads cleanly on GitHub Pages.

How To Use This Guide

For each question, try to answer in this order:

  1. Define the concept in one sentence.
  2. Explain why it matters in production.
  3. Give one example.
  4. Call out one trade-off or failure mode.
  5. If relevant, show how you would implement or debug it.

Table of Contents

  1. SQL
  2. Coding
  3. Data Modeling
  4. Data Architecture
  5. Interview Delivery Tips

SQL

What Interviewers Want To Hear

In SQL interviews, the interviewer usually wants proof that you can:

The strongest answer is not just a query. It is: what the query does, why it is correct, why it is efficient enough, and what could go wrong.

1. What Is A Window Function?

Short answer: A window function calculates a value across a related set of rows while still returning one output row for each input row.

Detailed answer:
Window functions are useful when you need row-level detail plus analytical context. Unlike GROUP BY, they do not collapse rows. That makes them ideal for ranking, running totals, moving averages, session boundaries, and previous-versus-current comparisons.

Example: running total per employee

SELECT
  employee_id,
  sale_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY employee_id
    ORDER BY sale_date
  ) AS running_total
FROM sales;

How to explain it live:

Common mistake: confusing window functions with grouped aggregations and accidentally returning the wrong grain.

2. GROUP BY vs Window Function

Short answer: GROUP BY reduces rows; window functions do not.

Detailed answer:
Use GROUP BY when the target output is one row per group, such as one row per customer or one row per day. Use a window function when the target output must keep row-level detail but also needs grouped context.

Example

-- one row per employee
SELECT employee_id, SUM(amount) AS total_sales
FROM sales
GROUP BY employee_id;

-- one row per sale with employee-level context
SELECT
  employee_id,
  sale_date,
  amount,
  SUM(amount) OVER (PARTITION BY employee_id) AS total_sales
FROM sales;

Interview line:
GROUP BY changes the grain. Window functions enrich the existing grain.

3. ROW_NUMBER vs RANK vs DENSE_RANK

Short answer: ROW_NUMBER is always unique; RANK leaves gaps on ties; DENSE_RANK does not.

Example with salaries 100, 100, 90:

Salary ROW_NUMBER RANK DENSE_RANK
100 1 1 1
100 2 1 1
90 3 3 2

When to use each:

4. Running Total

Short answer: Use SUM(...) OVER (ORDER BY ...).

SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (ORDER BY order_date) AS running_revenue
FROM daily_revenue;

Important note: if the data contains multiple rows per day and you want a daily running total, aggregate to day first, then apply the window.

5. Moving Average

Short answer: Use a window frame.

SELECT
  order_date,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3
FROM daily_revenue;

Interview trap: ROWS BETWEEN 2 PRECEDING means previous two rows, not necessarily previous two calendar days.

6. LEAD vs LAG

Short answer: LAG looks backward; LEAD looks forward.

SELECT
  user_id,
  event_time,
  event_time
    - LAG(event_time) OVER (
      PARTITION BY user_id
      ORDER BY event_time
    ) AS gap_from_previous
FROM events;

Use cases:

7. FIRST_VALUE and LAST_VALUE

Short answer: They return the first or last value in the window.

Important detail: LAST_VALUE often surprises people because the default frame may stop at the current row. To get the actual last value across the whole partition, define the frame explicitly.

SELECT
  customer_id,
  order_date,
  LAST_VALUE(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS last_order_date
FROM orders;

8. What Is PARTITION BY?

Short answer: It defines logical groups inside a window function.

Example: If you partition by customer_id, each customer’s rows are processed independently.

9. What Is ORDER BY In A Window?

Short answer: It defines sequence inside each partition.

Without it, ranking and running totals do not make business sense because there is no defined order.

10. What Is A Frame Clause?

Short answer: It controls which subset of ordered rows participates in the calculation.

Examples:

11. Deduplication Using ROW_NUMBER

Short answer: Partition by business key, order by freshness or quality, keep rank 1.

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC, ingest_ts DESC
    ) AS rn
  FROM customer_records
)
SELECT *
FROM ranked
WHERE rn = 1;

What to say in interview:

12. Top N Per Group

WITH ranked AS (
  SELECT
    category,
    product_id,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY revenue DESC
    ) AS rn
  FROM product_revenue
)
SELECT *
FROM ranked
WHERE rn <= 3;

Why interviewers ask this: it checks whether you understand partitioning and post-window filtering.

13. Detect Gaps In Dates

SELECT
  customer_id,
  order_date,
  LAG(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS previous_order_date
FROM orders;

Then compute the difference. Large gaps often indicate churn, inactivity, or missing data.

14. Sessionization

Short answer: Split a user’s event stream into sessions using an inactivity threshold.

WITH tagged AS (
  SELECT
    *,
    CASE
      WHEN LAG(event_time) OVER (
        PARTITION BY user_id
        ORDER BY event_time
      ) IS NULL THEN 1
      WHEN event_time - LAG(event_time) OVER (
        PARTITION BY user_id
        ORDER BY event_time
      ) > INTERVAL '30' MINUTE THEN 1
      ELSE 0
    END AS new_session
  FROM events
),
sessionized AS (
  SELECT
    *,
    SUM(new_session) OVER (
      PARTITION BY user_id
      ORDER BY event_time
    ) AS session_id
  FROM tagged
)
SELECT *
FROM sessionized;

Production considerations:

Join Questions You Should Answer Smoothly

INNER JOIN

Returns only matching rows from both tables.

LEFT JOIN

Returns every row from the left table and matched rows from the right table. If no match exists, right-side columns become NULL.

FULL OUTER JOIN

Returns matching rows plus unmatched rows from both sides.

SELF JOIN

Useful for hierarchical data such as employee-to-manager relationships.

SELECT
  e.employee_name,
  m.employee_name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.employee_id;

Semi Join

Use EXISTS when you only care whether a match exists.

SELECT *
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

Anti Join

Use a LEFT JOIN ... WHERE right.id IS NULL or NOT EXISTS pattern to find missing relationships.

SELECT c.*
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

15. Why Joins Create Duplicate Rows

Short answer: because the join key is not unique on at least one side.

Example: one customer has 3 orders and 2 support tickets. Joining both detail tables by customer produces 6 rows.

Best explanation: row explosion is a cardinality problem, not a SQL syntax problem.

16. How To Fix Join Duplication

17. Join Skew

Short answer: one or a few keys own a disproportionate amount of data, causing uneven work distribution.

Example: if 40 percent of events have country = 'US', a distributed join may overload one worker.

Mitigations:

18. Broadcast Join vs Shuffle Join

Broadcast join: replicate the small table to all workers.
Shuffle join: redistribute both tables by key.

Use broadcast when one side is small enough to replicate cheaply. It avoids moving the large table.

19. Null Behavior In Joins

Short answer: NULL = NULL is not true in standard SQL equality joins.

That means rows with null join keys usually do not match. In an interview, explicitly say you would decide whether null means “unknown” or “missing relationship.”

20. How To Debug A Bad Join

Use this sequence:

  1. Check row counts before the join.
  2. Check key uniqueness on each side.
  3. Check null rates.
  4. Sample exploding keys.
  5. Verify expected grain after the join.
  6. Compare business totals before and after.

Aggregations

COUNT(*) vs COUNT(column)

WHERE vs HAVING

Conditional Aggregation

SELECT
  department,
  SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) AS active_count,
  SUM(CASE WHEN status = 'inactive' THEN 1 ELSE 0 END) AS inactive_count
FROM employees
GROUP BY department;

DISTINCT Count

Use it for unique entities such as active users, unique customers, or unique devices.

Interview note: distinct counts are expensive at scale because the engine must track uniqueness.

Approximate Distinct Count

Use when exact precision is not necessary and scale is large. Examples include web analytics, telemetry, and observability systems.

Trade-off: much faster and smaller memory footprint, but not exact.

Median and Percentiles

These are more robust than averages when distributions are skewed.

Example: p95 latency tells you tail performance, while average latency can hide outliers.

Why AVG Can Mislead

If 99 requests take 10 ms and 1 request takes 5 seconds, the average looks acceptable while the user experience may still be bad.

Aggregation After Join Trap

Wrong pattern: join two detail tables and then sum.
Better pattern: aggregate each detail table to the required grain first, then join.

CTEs and Subqueries

What Is A CTE?

A CTE is a named temporary result set defined with WITH. It helps break a complex query into readable steps.

Why Use CTEs?

Recursive CTE

Useful for hierarchies such as org charts, folder trees, and dependency chains.

WITH RECURSIVE org AS (
  SELECT employee_id, manager_id, employee_name, 0 AS level
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.employee_id, e.manager_id, e.employee_name, o.level + 1
  FROM employees e
  JOIN org o
    ON e.manager_id = o.employee_id
)
SELECT *
FROM org;

Correlated Subquery

A correlated subquery depends on values from the outer query.

SELECT e.*
FROM employees e
WHERE salary > (
  SELECT AVG(salary)
  FROM employees
  WHERE department = e.department
);

EXISTS vs IN

EXISTS is often the clearer choice when the logic is row-by-row membership checking. It also avoids some null-related confusion seen with NOT IN.

Advanced SQL Topics You Should Be Ready For

Deduplication Strategies

Use a decision rule, not just DISTINCT.

Say the rule explicitly because deduplication is a business decision.

CDC Query Logic

CDC means capturing inserts, updates, and deletes from a source system.

Typical SQL behavior in a lakehouse:

Merge / Upsert Pattern

MERGE INTO dim_customer t
USING stage_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED AND t.hash_value <> s.hash_value THEN
  UPDATE SET
    customer_name = s.customer_name,
    updated_at = s.updated_at,
    hash_value = s.hash_value
WHEN NOT MATCHED THEN
  INSERT (customer_id, customer_name, updated_at, hash_value)
  VALUES (s.customer_id, s.customer_name, s.updated_at, s.hash_value);

What interviewer wants: do you understand idempotency, late-arriving data, and change detection.

Predicate Pushdown and Partition Pruning

Predicate pushdown: push filters to the source so less data is scanned.
Partition pruning: skip partitions that cannot contain relevant data.

Example: filtering WHERE event_date = DATE '2026-04-01' allows the engine to read one partition instead of a whole year.

Explain Plan Basics

When explaining a query plan, focus on:

Row-Level Security and Column Masking

These are not just governance terms. They affect query design and user-visible results.

Interview example:
An analyst in EMEA can see only EMEA rows, while PII columns are masked for non-privileged roles.

SQL Injection Basics

If asked, keep the answer simple:

SQL Practice Prompt Map

Use this table to answer many of the workbook prompts quickly.

Prompt Type Primary Tool What To Say
Latest record per key ROW_NUMBER() Partition by key, order by freshness, keep row 1
Top N per category ROW_NUMBER() or DENSE_RANK() Rank inside each group, filter rank
Missing relationships anti join Left join plus null filter or NOT EXISTS
Users with activity gaps LAG() Compare current event date with previous event date
Rolling metric window frame Explain row-based vs date-based frames
Above average in peer group correlated subquery or window avg Compare row to group average
Broken foreign keys anti join Find fact keys with no dimension match
Load validation counts plus sums Check row count, duplicates, nulls, totals

Coding

What Interviewers Want To Hear

For coding rounds, the target is usually:

  1. clarify the problem
  2. state brute force
  3. improve to an efficient approach
  4. code clearly
  5. test edge cases out loud

If your Python is weaker than your Java, keep the code simple. That is better than writing clever but brittle Python.

Arrays and Strings

Reverse A String

Brute force: build a new string from the back.
Optimized idea: Python slicing is already concise and efficient enough for interview scale.

def reverse_string(s: str) -> str:
    return s[::-1]

Complexity: O(n) time, O(n) space for the new string.

Palindrome Check

def is_palindrome(s: str) -> bool:
    cleaned = "".join(ch.lower() for ch in s if ch.isalnum())
    left, right = 0, len(cleaned) - 1
    while left < right:
        if cleaned[left] != cleaned[right]:
            return False
        left += 1
        right -= 1
    return True

Why this is good: it handles spaces, punctuation, and mixed case.

Two Sum

Brute force: try all pairs, O(n^2).
Optimized idea: hash map of value to index.

def two_sum(nums: list[int], target: int) -> list[int]:
    seen = {}
    for i, num in enumerate(nums):
        need = target - num
        if need in seen:
            return [seen[need], i]
        seen[num] = i
    return []

Complexity: O(n) time, O(n) space.

Edge cases:

Longest Substring Without Repeating Characters

Pattern: sliding window.

def length_of_longest_substring(s: str) -> int:
    last_seen = {}
    left = 0
    best = 0

    for right, ch in enumerate(s):
        if ch in last_seen and last_seen[ch] >= left:
            left = last_seen[ch] + 1
        last_seen[ch] = right
        best = max(best, right - left + 1)

    return best

Complexity: O(n) time, O(min(n, charset)) space.

Maximum Subarray

Pattern: Kadane’s algorithm.

def max_subarray(nums: list[int]) -> int:
    current = best = nums[0]
    for num in nums[1:]:
        current = max(num, current + num)
        best = max(best, current)
    return best

Interview explanation: at each position, either start a new subarray or extend the existing one.

Product Of Array Except Self

Pattern: prefix and suffix products without division.

def product_except_self(nums: list[int]) -> list[int]:
    n = len(nums)
    result = [1] * n

    prefix = 1
    for i in range(n):
        result[i] = prefix
        prefix *= nums[i]

    suffix = 1
    for i in range(n - 1, -1, -1):
        result[i] *= suffix
        suffix *= nums[i]

    return result

Complexity: O(n) time, O(1) extra space if output array is not counted.

Stack, Queue, Linked List

Valid Parentheses

def is_valid(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    stack = []

    for ch in s:
        if ch in "([{":
            stack.append(ch)
        else:
            if not stack or stack[-1] != pairs.get(ch):
                return False
            stack.pop()

    return not stack

Why this works: the latest opening bracket must match the next closing bracket, so stack order maps perfectly to the problem.

Reverse Linked List

class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next


def reverse_list(head: ListNode | None) -> ListNode | None:
    prev = None
    curr = head

    while curr:
        nxt = curr.next
        curr.next = prev
        prev = curr
        curr = nxt

    return prev

Interview explanation: keep three pointers: previous, current, next.

Detect Cycle In Linked List

Use Floyd’s tortoise and hare algorithm.

def has_cycle(head: ListNode | None) -> bool:
    slow = fast = head
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
        if slow == fast:
            return True
    return False

Complexity: O(n) time, O(1) space.

Trees and Graphs

BFS vs DFS

BFS: level by level using a queue.
DFS: go deep first using recursion or a stack.

Use BFS for shortest path in an unweighted graph or level-order traversal. Use DFS for traversal, cycle checks, or exhaustive exploration.

Binary Tree Traversal

def inorder(root):
    if not root:
        return []
    return inorder(root.left) + [root.val] + inorder(root.right)

Know the order:

Number Of Islands

Pattern: graph traversal on a grid.

def num_islands(grid: list[list[str]]) -> int:
    if not grid:
        return 0

    rows, cols = len(grid), len(grid[0])
    count = 0

    def dfs(r: int, c: int) -> None:
        if r < 0 or r >= rows or c < 0 or c >= cols:
            return
        if grid[r][c] != "1":
            return

        grid[r][c] = "0"
        dfs(r + 1, c)
        dfs(r - 1, c)
        dfs(r, c + 1)
        dfs(r, c - 1)

    for r in range(rows):
        for c in range(cols):
            if grid[r][c] == "1":
                count += 1
                dfs(r, c)

    return count

Complexity: O(rows * cols) time.

Topological Sort / Course Schedule

If asked about course schedule, say:

That is already a strong answer even before coding.

Fibonacci Recursion vs DP

Recursion only: elegant but repeats work exponentially.
DP: stores prior results and reduces time dramatically.

def fib(n: int) -> int:
    if n <= 1:
        return n

    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

Coin Change

Pattern: dynamic programming with minimum subproblem cost.

def coin_change(coins: list[int], amount: int) -> int:
    dp = [amount + 1] * (amount + 1)
    dp[0] = 0

    for current in range(1, amount + 1):
        for coin in coins:
            if coin <= current:
                dp[current] = min(dp[current], dp[current - coin] + 1)

    return dp[amount] if dp[amount] != amount + 1 else -1
def binary_search(nums: list[int], target: int) -> int:
    left, right = 0, len(nums) - 1

    while left <= right:
        mid = left + (right - left) // 2
        if nums[mid] == target:
            return mid
        if nums[mid] < target:
            left = mid + 1
        else:
            right = mid - 1

    return -1

What to say: binary search requires a monotonic condition, usually a sorted array or a search space where one side can be eliminated.

Coding Prompt Pattern Map

Many of the listed workbook problems reduce to these patterns:

Pattern Common Problems
Hash map Two Sum, Valid Anagram, Character Frequency
Two pointers Palindrome, Three Sum, Merge Sorted Arrays
Sliding window Longest Substring, Minimum Window, Max Window
Prefix sum Range sum, subarray totals, difference arrays
Stack Valid Parentheses, Daily Temperatures, Histogram
Linked-list pointers Reverse list, middle node, cycle detection
BFS/DFS Number of Islands, tree traversal, connected components
Heap Top K Frequent, Kth Largest, task scheduling
Dynamic programming Coin Change, Climbing Stairs, LCS, Edit Distance
Binary search Search insert, rotated array, first/last position

Strong interview line:
I want to classify the problem first, because that usually determines the right data structure and complexity target.

Coding Quick Answer Bank

Use these one-line expansions for the rest of the coding workbook:

Problem Best Pattern What To Say
Three Sum sort + two pointers Fix one value, solve Two Sum on the suffix, skip duplicates
Merge Sorted Arrays two pointers Compare heads and write smaller item first
Remove Duplicates From Sorted Array slow/fast pointers Keep a write pointer for the next unique value
Rotate Array reverse sections Reverse whole array, then reverse both parts
Group Anagrams hash map on sorted key Sorted characters or frequency tuple identifies each group
Majority Element Boyer-Moore Majority candidate survives pair cancellation
Merge Intervals sort + sweep Sort by start time, then merge overlaps greedily
Meeting Rooms Overlap sort + compare neighbors If current start is before previous end, there is overlap
Min Stack stack of pairs Store value plus running minimum
LRU Cache hash map + doubly linked list Need O(1) lookup and O(1) eviction
Daily Temperatures monotonic stack Keep decreasing temperatures until a warmer day appears
Trapping Rain Water two pointers or stack Water depends on left max and right max boundaries
Validate BST recursion with bounds Every node must stay inside inherited min/max limits
Lowest Common Ancestor recursive split First node where targets diverge is the answer
Clone Graph DFS/BFS + map Need visited map from original node to cloned node
Top K Frequent heap or bucket sort Count first, then extract highest frequencies
Longest Common Subsequence 2D DP dp[i][j] is the best answer for both prefixes
Edit Distance 2D DP Transition is insert, delete, or replace
Combination Sum backtracking Choose current number again or move forward
Kth Largest Element heap or quickselect Keep top k efficiently instead of fully sorting

Data Modeling

Core Interview Formula

For a modeling question, answer in this order:

  1. business goal
  2. grain
  3. facts
  4. dimensions
  5. keys
  6. history strategy
  7. quality checks
  8. performance strategy

1. Fact vs Dimension

Fact table: stores measurable events or outcomes.
Dimension table: stores descriptive context used to slice facts.

Example:

Interview line: facts answer “what happened” and dimensions answer “in what context.”

2. Grain

Short answer: grain is the exact meaning of one row.

Examples:

If you do not define grain first, the rest of the model becomes ambiguous and metrics become unreliable.

3. Star Schema vs Snowflake Schema

Star schema: central fact table with denormalized dimensions.
Snowflake schema: dimensions are normalized into additional related tables.

When to prefer star:

When snowflake can help:

4. Visual: Star Schema

Star schema example

5. Surrogate Key vs Natural Key

Natural key: business-generated identifier such as customer_id from a source system.
Surrogate key: warehouse-generated identifier used for modeling and history tracking.

Use surrogate keys when:

6. SCD Type 1, 2, and 3

Type 1

Overwrite old value with new value. No history preserved.

Use when: correcting errors or storing current state only.

Type 2

Insert a new version row for each meaningful change and track validity dates.

Typical columns:

Use when: historical reporting matters.

Type 3

Store current and previous value in the same row.

Use when: only limited history is needed, such as previous plan and current plan.

7. Degenerate, Junk, and Conformed Dimensions

Degenerate dimension

A business identifier stored in the fact table with no separate dimension, such as order_number.

Junk dimension

A dimension that groups low-cardinality flags together, such as is_mobile, is_promo, is_new_user.

Conformed dimension

A shared dimension used consistently across multiple fact tables, such as a common dim_date or dim_customer.

8. Snapshot Models

Periodic snapshot

One row per entity per fixed interval, such as account balance per day.

Accumulating snapshot

One row per workflow item that gets updated as milestones are reached, such as order lifecycle from placed to shipped to delivered.

Event table

One row per event. Best when you want the raw history and can derive later snapshots.

9. Factless Fact Table

Used to record that something happened even when there is no numeric measure.

Example: student attendance, store visit, policy violation, eligibility relationship.

10. Bridge Table

Use a bridge table for many-to-many relationships.

Examples:

11. Late-Arriving Data

If late-arriving facts reference dimensions that are not loaded yet, common strategies are:

Interview point: explain whether correctness or freshness matters more.

12. Partitioning and Clustering

Partitioning improves scan efficiency by physically grouping data on a coarse key such as date.
Clustering or sorting improves pruning and locality inside partitions.

Bad partition example: partitioning by a high-cardinality user ID.
Better partition example: partition by event date, cluster by customer or region if it helps common filters.

13. User Growth Model Example

Business goal: track acquisition, activation, retention, and engagement.

Recommended model:

Derived metrics:

Quality checks:

14. Design Trade-Offs You Should Mention

Design Choice Benefit Trade-Off
Star schema simple analytics more dimension redundancy
Snowflake normalized dimensions more joins, harder BI use
SCD2 historical accuracy more storage and query complexity
Event model full detail downstream aggregation required
Snapshot model easy reporting storage growth and update logic
Wide table fewer joins schema rigidity and sparse columns

15. How To Explain A Model In An Interview

Use this sentence flow:

  1. The business question is ___.
  2. The grain is ___.
  3. The main fact tables are ___.
  4. The key dimensions are ___.
  5. History is handled with ___.
  6. Data quality is enforced through ___.
  7. Performance is improved with ___.
  8. The main trade-off is ___.

Modeling Quick Answer Bank

Topic Strong Short Answer
Normalization reduce redundancy and improve consistency in transactional systems
Denormalization duplicate selected attributes to simplify analytics and reduce joins
3NF each attribute depends on the key, the whole key, and nothing but the key
OLTP vs OLAP OLTP optimizes writes and transactions; OLAP optimizes analytical reads
JSON column trade-offs flexible ingestion, but weaker governance and harder analytics if overused
Wide vs narrow table wide is convenient for reads; narrow is more extensible and storage efficient
Aggregated table precompute expensive metrics for faster dashboards and common BI queries
Data versioning preserve historical states so changes are explainable and reproducible
GDPR deletion support subject lookup, delete propagation, and downstream purge or tombstone handling
Unknown member row use a default dimension row so facts are not dropped when reference data is missing
Historical restatement recalculate prior periods only when business rules require corrected history
Semantic layer centralize metric definitions so teams do not compute the same KPI differently

Data Architecture

Core Interview Formula

For system or platform design questions, cover:

  1. requirements
  2. scale
  3. latency
  4. ingestion
  5. storage
  6. processing
  7. serving
  8. governance
  9. observability
  10. failure handling
  11. trade-offs

1. Batch vs Streaming

Batch: process data in chunks on a schedule.
Streaming: process data continuously with low latency.

When to use batch:

When to use streaming:

Strong answer: I choose based on business latency requirements, not because one is universally better.

2. Kafka Basics

Kafka is a distributed log used for durable, ordered event streams.

Key terms:

Good interview example: clickstream events, CDC events, payment state changes.

3. Delivery Guarantees

At-most-once

Messages may be lost, but never processed more than once.

At-least-once

Messages are retried if needed, so duplicates are possible.

Exactly-once

Usually means exactly-once effect in the end-to-end pipeline, not magic delivery with no coordination. It requires idempotency, transactional guarantees, or deduplication.

4. Backpressure and Dead-Letter Queues

Backpressure: downstream system cannot keep up with upstream input rate.
DLQ: isolated storage for records that repeatedly fail processing.

What to say: DLQs are useful, but only if someone monitors and reprocesses them. A dead-letter queue without operations is just hidden data loss.

5. Idempotency

An idempotent pipeline can safely reprocess the same event without changing the final outcome incorrectly.

Examples:

6. CDC Pipeline Design

Typical design:

  1. source database emits change logs
  2. capture layer reads inserts, updates, deletes
  3. events land in Kafka or object storage
  4. processing layer standardizes schema and orders changes
  5. sink applies merge or append logic to target tables
  6. validation checks counts, lag, duplicates, and delete handling

Failure modes to mention:

7. API Ingestion Design

Typical approach:

Trade-offs: APIs are often slower, rate-limited, and less reliable than CDC or file-based ingestion.

8. Spark, Trino, and Lakehouse Roles

Spark

Best for large-scale transformation, batch processing, streaming jobs, ML preparation, and custom compute logic.

Trino

Best for interactive SQL, federated access, and fast analytical reads across many systems.

Lakehouse

Combines warehouse-style table management with data lake storage flexibility.

Strong comparison: Spark is often the compute engine for heavy transformation. Trino is often the query engine for interactive analytics and federation.

9. Iceberg and Delta Basics

Both are table formats that add transactional metadata to data lake storage.

Common benefits:

What to say if asked for differences: both solve similar problems; exact trade-offs depend on engine ecosystem, operational standards, and feature maturity in your platform.

10. Lakehouse Concept

A lakehouse stores data in open object storage but manages it like a queryable table system with metadata, transactions, and governance.

That allows multiple engines to share data while avoiding some classic data-lake chaos.

11. Visual: Platform Flow

Data platform pipeline example

12. Governance

Be ready to speak about:

Strong interview line: governance is not a separate compliance layer added at the end. It has to be built into ingestion, storage, query serving, and auditing.

13. Observability

For data platforms, observability should cover:

Good answer: I want both platform health and data quality visibility, because pipelines can be technically healthy while still producing wrong data.

14. Reliability and Fault Tolerance

Key ideas to mention:

15. Architecture Trade-Offs

Choice Benefit Cost
Batch simple and cheaper higher latency
Streaming low latency more operational complexity
Federation fresh access to source systems source dependency and governance complexity
Central ingestion predictable curated datasets latency and duplication
Open table formats interoperability operational maturity required
Multi-tenant platform cost efficiency stronger isolation and governance required

16. Example: How To Answer A Real-Time Fraud Platform Question

Requirements: detect suspicious transactions within seconds, support analyst review, maintain auditability.
Ingestion: transaction events stream through Kafka.
Processing: stream processor enriches events with customer and device context.
Storage: hot alerts in operational store, full event history in lakehouse.
Serving: alert API plus analyst dashboard.
Governance: PII masking, role-based access, lineage for decisions.
Observability: end-to-end lag, false positive rate, dropped events, alert throughput.
Failure handling: replay from Kafka, idempotent alert writes, DLQ for malformed events.
Trade-off: lower latency increases complexity, so I would reserve the streaming path for decisions that truly need it.

17. How To Present Architecture In An Interview

Talk from left to right:

  1. sources
  2. ingestion
  3. storage
  4. processing
  5. serving
  6. governance and monitoring across the whole flow

That keeps the answer structured and easier to follow.

Architecture Quick Answer Bank

Topic Strong Short Answer
Topic, partition, offset topic is the stream, partition is an ordered shard, offset is record position
Producer vs consumer producer writes events, consumer reads and processes them
Consumer group lets multiple consumers split partitions for parallel processing
Lambda vs Kappa Lambda separates batch and speed layers; Kappa keeps a single streaming path
Data lake vs warehouse lake is storage-first and flexible; warehouse is modeled and query-optimized
Metadata layer stores technical and business context about datasets, ownership, lineage, and policies
Data contract formal agreement on schema, semantics, quality, and change expectations
RBAC vs ABAC RBAC uses roles; ABAC evaluates attributes such as team, region, sensitivity, and purpose
Encryption at rest vs in transit protect stored data on disk and protect network traffic between systems
SLA vs SLO SLA is the external commitment; SLO is the internal reliability target
Materialized view precomputed query result for faster reads, with refresh trade-offs
Small files problem too many tiny files increase metadata and planning overhead in lakehouse systems
Partition evolution change partition strategy over time without rewriting all historical data when format allows it
Federation vs ingestion federation favors freshness and lower duplication; ingestion favors curated performance and control
Cost vs latency lower latency usually requires more always-on infrastructure and operational complexity

Interview Delivery Tips

A Good SQL Answer Sounds Like This

I will first define the output grain. Then I will choose the correct join and aggregation strategy. After that I will validate row counts and explain any performance considerations such as partition pruning or join explosion.

A Good Coding Answer Sounds Like This

I will start with the brute-force approach so we have a correct baseline. Then I will optimize it using the right data structure, state the complexity, and test the edge cases before finalizing the code.

A Good Modeling Answer Sounds Like This

I will start from the business question and define the grain first. Then I will identify facts, dimensions, history strategy, and the quality checks needed to keep the model trustworthy.

A Good Architecture Answer Sounds Like This

I will clarify requirements first, especially scale and latency. Then I will walk through ingestion, storage, processing, serving, governance, and failure handling, and I will close with the main trade-offs.

Final Advice

If you want, the next logical step is to create part 2 as a separate file with: