Data Engineering Interview Crash Workbook

3-Day Intensive Prep for Salesforce, Google, Snowflake, Databricks

Table of Contents

  1. How To Use This Workbook
  2. 3-Day Study Plan
  3. Master Question Bank With Answer Patterns
  4. Section 1: SQL
  5. Section 2: Coding
  6. Section 3: Data Modeling
  7. Section 4: Data Architecture
  8. Rapid Hands-On Exercise Plan
  9. Final 3-Day Prioritization
  10. Interview Speaking Rules
  11. Personal Focus For You

How To Use This Workbook

You have 3 days, so do not try to memorize 500 disconnected answers. Use this workbook in this order for every topic:

  1. Understand the idea
  2. Say the short answer out loud
  3. Expand into the long answer
  4. Do the exercise by hand
  5. Code or query it yourself
  6. Explain the trade-off

For each topic below, use this template:


3-Day Study Plan

Day 1 β€” SQL + Data Modeling

Morning

Afternoon

Evening

Day 2 β€” Coding + Python Basics

Morning

Afternoon

Evening

Day 3 β€” Architecture + Behavioral + Mock Interviews

Morning

Afternoon

Evening


Master Question Bank With Answer Patterns

The 500 questions are grouped into patterns. Master the pattern once, then you can answer 10–20 related questions.


Section 1 β€” SQL (1–150)

A. Window Functions (1–25)

  1. What is a window function?

Short answer: A window function calculates a value across related rows without collapsing the result set.

Long answer: Unlike GROUP BY, which reduces rows, a window function keeps each row and adds analytical context such as rank, running total, moving average, or previous/next row comparison. It usually uses OVER(), optionally with PARTITION BY and ORDER BY.

Exercise: Take a sales table with columns (employee_id, sale_date, amount). Write a query that shows the running total by employee.

SELECT
  employee_id,
  sale_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY employee_id
    ORDER BY sale_date
  ) AS running_total
FROM sales;

Understand it:


  1. Difference between GROUP BY and window function

Short answer: GROUP BY collapses rows; window functions do not.

Long answer: Use GROUP BY when you need one row per group. Use window functions when you need row-level detail plus aggregated context.

Exercise: Compare these two queries on the same table and explain why one returns fewer rows.


  1. ROW_NUMBER vs RANK vs DENSE_RANK

Short answer: ROW_NUMBER always unique; RANK leaves gaps on ties; DENSE_RANK does not.

Long answer:

Exercise: For salary values 100, 100, 90, write all three rankings and explain the output.


  1. Running total

Short answer: Use SUM(…) OVER (ORDER BY …).

Exercise:

SELECT order_date,
       revenue,
       SUM(revenue) OVER (ORDER BY order_date) AS running_revenue
FROM daily_revenue;

  1. Moving average

Short answer: Use a frame clause such as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.

Exercise:

SELECT order_date,
       AVG(revenue) OVER (
         ORDER BY order_date
         ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
       ) AS moving_avg_3
FROM daily_revenue;

Understand it: This is a 3-row moving average, not always 3 calendar days.


  1. LEAD vs LAG

Short answer: LAG looks backward; LEAD looks forward.

Exercise: Find time between two consecutive user events.

SELECT
  user_id,
  event_time,
  event_time - LAG(event_time) OVER (
    PARTITION BY user_id ORDER BY event_time
  ) AS gap_from_previous
FROM events;

  1. FIRST_VALUE and LAST_VALUE

Short answer: Return first or last value in the window.

Important trap: LAST_VALUE often needs a frame clause, otherwise it may not behave as expected.


  1. NTILE

Short answer: Splits ordered rows into buckets.

Use case: quartiles, deciles, customer segmentation.


  1. PERCENT_RANK

Short answer: Gives relative rank from 0 to 1.


  1. CUME_DIST

Short answer: Cumulative distribution showing fraction of rows less than or equal to current row.


  1. What is PARTITION BY?

Short answer: It defines groups for the window function.


  1. What is ORDER BY in a window?

Short answer: It defines processing sequence inside each partition.


  1. What is a frame clause?

Short answer: It narrows which rows in the ordered partition are included.


  1. Deduplicate records using ROW_NUMBER

Short answer: Assign row numbers per business key, keep rn = 1.

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC
         ) AS rn
  FROM customer_records
)
SELECT *
FROM ranked
WHERE rn = 1;

Understand it: This keeps the latest record per customer.


  1. Top N per group

Short answer: Use ROW_NUMBER() or DENSE_RANK() with PARTITION BY.


  1. Detect gaps in dates or IDs

Short answer: Compare current row with previous row using LAG.


  1. Sessionization

Short answer: Split user activity into sessions based on inactivity threshold.

Long answer: Use LAG to compare current event time with previous event time. If the gap exceeds a threshold like 30 minutes, mark it as a new session. Then use cumulative sum to generate session IDs.

WITH tagged AS (
  SELECT *,
         CASE
           WHEN event_time - LAG(event_time) OVER (
             PARTITION BY user_id ORDER BY event_time
           ) > INTERVAL '30' MINUTE
           OR LAG(event_time) OVER (
             PARTITION BY user_id ORDER BY event_time
           ) IS NULL
           THEN 1 ELSE 0
         END AS new_session
  FROM events
), sessions AS (
  SELECT *,
         SUM(new_session) OVER (
           PARTITION BY user_id ORDER BY event_time
         ) AS session_id
  FROM tagged
)
SELECT * FROM sessions;

Exercise: Change threshold from 30 minutes to 15 minutes and explain what changes.


18–25. Practice prompts

  1. Rank employees within department by salary
  2. Find first purchase date per customer
  3. Find repeat purchase gap
  4. Rolling 7-day revenue
  5. Top 3 products per category
  6. Detect change in subscription plan
  7. Compare current and previous order amount
  8. Bucket customers into revenue quartiles

For each one, answer with:


B. Joins (26–50)

Sample tables used in the examples below

customers

customer_id customer_name region
1 Alice EU
2 Bob US
3 Carol EU
4 Dan APAC

orders

order_id customer_id region amount
101 1 EU 120
102 1 EU 80
103 2 US 75
104 5 US 60

employees

employee_id employee_name manager_id
10 Eva null
11 Frank 10
12 Grace 10
  1. What is an INNER JOIN?

Short answer: Returns only matching rows from both tables.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id,
  o.amount
FROM customers c
INNER JOIN orders o
  ON c.customer_id = o.customer_id;

Result:

customer_id customer_name order_id amount
1 Alice 101 120
1 Alice 102 80
2 Bob 103 75
  1. What is a LEFT JOIN?

Short answer: Returns all rows from the left table and matching rows from the right.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

Result:

customer_id customer_name order_id
1 Alice 101
1 Alice 102
2 Bob 103
3 Carol null
4 Dan null
  1. Difference between LEFT JOIN and INNER JOIN

Short answer: LEFT keeps unmatched left-side rows; INNER removes them.

  1. RIGHT JOIN

Short answer: Same logic as LEFT JOIN but preserves right table rows.

  1. FULL OUTER JOIN

Short answer: Returns all matching and non-matching rows from both tables.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id,
  o.amount
FROM customers c
FULL OUTER JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

Result:

customer_id customer_name order_id amount
1 Alice 101 120
1 Alice 102 80
2 Bob 103 75
3 Carol null null
4 Dan null null
null null 104 60

Understand it: this keeps unmatched rows from both sides.

  1. CROSS JOIN

Short answer: Cartesian product of both tables.

Sample tables for cross join

colors

color
Red
Blue

sizes

size
S
M
L

Example:

SELECT
  c.color,
  s.size
FROM colors c
CROSS JOIN sizes s
ORDER BY c.color, s.size;

Result:

color size
Blue L
Blue M
Blue S
Red L
Red M
Red S

Understand it: every row in colors pairs with every row in sizes, so 2 x 3 = 6 rows.

  1. SELF JOIN

Short answer: A table joined to itself, often for hierarchy or comparisons.

Example:

SELECT
  e.employee_name,
  m.employee_name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.employee_id;

Result:

employee_name manager_name
Eva null
Frank Eva
Grace Eva

Understand it: the same employees table plays two roles here:

  1. Anti join

Short answer: Find rows in A not in B.

SELECT c.*
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Result: Carol and Dan

  1. Semi join

Short answer: Return rows in A that have a match in B, without bringing B columns.

SELECT
  c.customer_id,
  c.customer_name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE c.customer_id = o.customer_id
);

Result:

customer_id customer_name
1 Alice
2 Bob

Understand it: unlike a regular join, this only answers β€œdoes a match exist?” and does not return columns from orders.

Anti join vs LEFT JOIN

Short answer: LEFT JOIN keeps all rows from the left table; ANTI JOIN keeps only left rows that do not match.

Example LEFT JOIN:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

LEFT JOIN result:

customer_id customer_name order_id
1 Alice 101
1 Alice 102
2 Bob 103
3 Carol null
4 Dan null

Example ANTI JOIN pattern:

SELECT
  c.customer_id,
  c.customer_name
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

ANTI JOIN result:

customer_id customer_name
3 Carol
4 Dan

Understand it: anti join is usually written using LEFT JOIN ... WHERE right_key IS NULL or NOT EXISTS (...).

  1. What causes duplicate rows after a join?

Short answer: Non-unique join keys.

Understand it: If one customer has 3 orders and 2 payments, a direct join may create 6 rows.

  1. How do you fix join duplication?

Short answer: Understand cardinality, pre-aggregate, or deduplicate before joining.

  1. What is join skew?

Short answer: Uneven key distribution causes one node or partition to process far more data.

  1. Broadcast join vs shuffle join

Short answer: Broadcast sends small table to all workers; shuffle redistributes both sides by key.

  1. When should you use a broadcast join?

Short answer: When one side is small enough to replicate efficiently.

  1. Joining on multiple keys

Exercise: Write a join on (customer_id, region) and explain why using only one column is dangerous.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
JOIN orders o
  ON c.customer_id = o.customer_id
 AND c.region = o.region;

Why this matters: if customer_id is reused across regions or the business key is composite, joining on only customer_id can create incorrect matches.

  1. Non-equi join

Short answer: Join condition uses <, >, BETWEEN, or other non-equality conditions.

Mini example: join orders to a pricing-band table using amount BETWEEN min_amount AND max_amount.

  1. Join order and performance

Short answer: Filter early, shrink large tables before joining.

  1. Null behavior in joins

Short answer: NULL = NULL is not true, so null keys do not match in standard equality joins.

  1. How do you debug a bad join?

Checklist:

Quick debug query:

SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this returns rows, you already know that joining customers to orders can create multiple rows per customer unless that is the intended output grain.

45–50. Practice prompts

  1. Find customers with no orders
  2. Find orders with missing product reference
  3. Join employee with manager using self join
  4. Explain many-to-many join explosion
  5. Compare EXISTS vs IN
  6. Reduce join cost on large datasets

C. Aggregations (51–75)

  1. COUNT(*) vs COUNT(column)

Short answer: COUNT(*) counts all rows; COUNT(column) ignores nulls.

  1. What does GROUP BY do?

Short answer: Groups rows by one or more columns before aggregation.

  1. WHERE vs HAVING

Short answer: WHERE filters rows before grouping; HAVING filters groups after aggregation.

  1. Conditional aggregation

Short answer: Use SUM(CASE WHEN … THEN 1 ELSE 0 END) or similar.

SELECT
  department,
  SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) AS active_count
FROM employees
GROUP BY department;
  1. DISTINCT count

Short answer: COUNT(DISTINCT user_id) counts unique users.

  1. Approx distinct count

Short answer: Used for large-scale systems when exact distinct count is expensive.

  1. ROLLUP

Short answer: Adds subtotal and grand total rows.

  1. CUBE

Short answer: Produces all combinations of subtotals.

  1. GROUPING SETS

Short answer: Lets you explicitly choose multiple grouping combinations.

  1. Median

Short answer: Some engines support percentile_cont(0.5).

  1. Percentile

Short answer: Used to understand distribution, not just average.

  1. Why is AVG sometimes misleading?

Short answer: Outliers can distort it; median may be better.

  1. Nulls in aggregations

Short answer: Most aggregates ignore nulls except COUNT(*).

  1. Aggregation after join trap

Short answer: If a join duplicates rows, your sums and counts become wrong.

65–75. Practice prompts

  1. Daily active users
  2. Revenue by region
  3. Monthly cumulative sales
  4. Ratio of active to total users
  5. 95th percentile latency
  6. Distinct customers by month
  7. Product revenue share
  8. Top category by revenue
  9. Countries with declining growth
  10. Average order value
  11. Fraud count by severity

D. CTEs and Subqueries (76–100)

  1. What is a CTE?

Short answer: A named temporary result set used inside a query.

  1. Why use CTEs?

Short answer: Readability, reusability, step-wise logic.

  1. Recursive CTE

Short answer: A CTE that references itself, used for hierarchies or graph-like traversal.

  1. Subquery vs CTE

Short answer: Similar logically; CTE is often easier to read and debug.

  1. Correlated subquery

Short answer: A subquery that depends on values from the outer query.

  1. EXISTS vs IN

Short answer: Both check membership; EXISTS is often clearer and can be better for correlated logic.

  1. Scalar subquery

Short answer: Returns one value, such as a max date.

  1. Common CTE interview pattern

Pattern:

  1. Filter base data
  2. Rank or aggregate
  3. Join results
  4. Final filter

84–100. Practice prompts

  1. Latest record per customer
  2. Org hierarchy using recursive CTE
  3. Orders above customer average
  4. Users with no transactions
  5. Products never sold
  6. Most recent login per user
  7. Monthly cohort extraction
  8. Nested CTE refactor
  9. CTE + window combination
  10. Recursive tree traversal
  11. Parent-child path building
  12. Subquery to CTE rewrite
  13. Use EXISTS for filtering
  14. Find max salary per department
  15. Find rows above department average
  16. Query decomposition strategy
  17. Explain readability vs performance trade-off

E. Advanced SQL (101–150)

For these, use the answer pattern below.

Short answer template: Define it in one sentence.
Long answer template: Explain use case, benefit, and one limitation.
Hands-on template: Write one query or pseudo-query.

  1. Deduplication strategies
  2. CDC query logic
  3. Snapshot vs incremental query
  4. Merge / upsert pattern
  5. Time travel query
  6. JSON parsing in SQL
  7. Arrays in SQL
  8. Struct / nested fields
  9. Regex filtering
  10. Data masking
  11. Row-level security
  12. Column masking
  13. Predicate pushdown
  14. Partition pruning
  15. File skipping
  16. Explain plan basics
  17. Cost-based optimization
  18. Materialized views
  19. Query caching
  20. Resource groups
  21. Query federation
  22. Trino pushdown behavior
  23. Snowflake clustering basics
  24. BigQuery partitioning basics
  25. Iceberg metadata tables
  26. Delta transaction log idea
  27. Null-safe comparisons
  28. Merge late-arriving data
  29. Validate counts after load
  30. Build audit queries
  31. Find broken foreign keys
  32. Data quality assertions
  33. Window performance concerns
  34. Join performance concerns
  35. Aggregation skew
  36. Snapshot isolation concept
  37. Query retry strategy
  38. Fault-tolerant SQL pipelines
  39. Access filters
  40. Secure data views
  41. Multi-tenant filtering
  42. Dynamic SQL risks
  43. SQL injection basics
  44. Query observability
  45. Debug wrong result sets
  46. Debug row explosion
  47. Debug null issues
  48. Query anti-patterns
  49. SQL interview best practices
  50. How to talk through a SQL solution live

Section 2 β€” Coding (151–300)

Use this answer format for all coding problems:

A. Arrays and Strings (151–180)

  1. Reverse a string

Short answer: Use slicing in Python or two pointers.

def reverse_string(s: str) -> str:
    return s[::-1]
  1. Palindrome check
def is_palindrome(s: str) -> bool:
    cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())
    return cleaned == cleaned[::-1]
  1. Two Sum

Short answer: Use a hash map for O(n).

def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        need = target - n
        if need in seen:
            return [seen[need], i]
        seen[n] = i
    return []

Understand it: store past values, not future values.

  1. Three Sum

Short answer: Sort + two pointers.

  1. Longest substring without repeating characters

Short answer: Sliding window + set/map.

  1. Maximum subarray

Short answer: Kadane’s algorithm.

  1. Merge sorted arrays

  2. Remove duplicates from sorted array

  3. Rotate array

  4. Prefix sum

  5. Difference array idea

  6. Group anagrams

  7. Valid anagram

  8. String compression

  9. Character frequency map

  10. Substring search

  11. Pattern matching

  12. Sliding window max

  13. Longest common prefix

  14. Roman to integer

  15. Integer to Roman

  16. String to integer

  17. Reverse words in sentence

  18. Product of array except self

  19. Find missing number

  20. Best time to buy and sell stock

  21. Majority element

  22. Merge intervals

  23. Insert interval

  24. Meeting rooms overlap

For each one, do:

  1. write brute force
  2. optimize
  3. code it
  4. say time/space

B. Stack, Queue, Linked List (181–210)

  1. Valid parentheses
def is_valid(s: str) -> bool:
    pairs = {')': '(', ']': '[', '}': '{'}
    stack = []
    for ch in s:
        if ch in pairs.values():
            stack.append(ch)
        else:
            if not stack or stack[-1] != pairs.get(ch):
                return False
            stack.pop()
    return not stack
  1. Min stack

  2. Queue using two stacks

  3. Stack using queues

  4. Reverse linked list

class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next

def reverse_list(head):
    prev = None
    curr = head
    while curr:
        nxt = curr.next
        curr.next = prev
        prev = curr
        curr = nxt
    return prev
  1. Detect cycle in linked list

  2. Merge two sorted linked lists

  3. Find middle of linked list

  4. Remove nth node from end

  5. Reorder list

  6. LRU cache concept

  7. Circular queue

  8. Deque use cases

  9. Monotonic stack

  10. Next greater element

  11. Daily temperatures

  12. Largest rectangle in histogram

  13. Trapping rain water

  14. Evaluate reverse polish notation

  15. Sliding window with deque

201–210. Practice repeats and variations


C. Trees and Graphs (211–260)

  1. Binary tree traversal

Short answer:

  1. BFS vs DFS

Short answer: BFS uses queue, DFS uses stack/recursion.

  1. Maximum depth of binary tree

  2. Invert binary tree

  3. Validate BST

  4. Lowest common ancestor

  5. Level order traversal

  6. Tree diameter

  7. Path sum

  8. Serialize and deserialize tree

  9. Number of islands

  10. Clone graph

  11. Course schedule

  12. Topological sort

  13. Detect cycle in graph

  14. Connected components

  15. Shortest path in unweighted graph

  16. Dijkstra basics

  17. Union find basics

  18. Bipartite graph check

231–260. Practice variants

Hands-on rule: Draw the tree or graph before coding.


D. Recursion, Backtracking, DP, Search (261–300)

  1. Fibonacci recursion vs DP

  2. Climbing stairs

  3. Coin change

  4. Longest common subsequence

  5. Edit distance

  6. Knapsack idea

  7. Subsets

  8. Permutations

  9. Combination sum

  10. N-Queens idea

  11. Binary search

  12. Search in rotated sorted array

  13. Find first and last position

  14. Median of two sorted arrays idea

  15. Kth largest element using heap

  16. Top K frequent elements

  17. Heap basics

  18. Recursion tree thinking

  19. Memoization vs tabulation

  20. State definition in DP

281–300. Mixed practice

Interview speaking tip: Always say:


Section 3 β€” Data Modeling (301–400)

Core Answer Formula

For every modeling question answer in this order:

  1. business goal
  2. grain
  3. dimensions
  4. facts
  5. keys
  6. update pattern
  7. quality checks
  8. performance considerations

  9. Fact vs dimension

Short answer: Facts store measurable events; dimensions store descriptive context.

  1. Star schema

Short answer: Fact table in center with denormalized dimensions around it.

  1. Snowflake schema

Short answer: Dimensions are normalized into multiple related tables.

  1. When to use star vs snowflake

Short answer: Star for analytics simplicity and speed; snowflake for reduced redundancy and stricter modeling.

  1. What is grain?

Short answer: The exact level of detail represented by one row in a fact table.

Example: one row per order line, not one row per order.

  1. Surrogate key vs natural key

  2. SCD Type 1

  3. SCD Type 2

  4. SCD Type 3

  5. Factless fact table

  6. Degenerate dimension

  7. Junk dimension

  8. Conformed dimension

  9. Bridge table

  10. Role-playing dimension

  11. Snapshot table

  12. Accumulating snapshot

  13. Periodic snapshot

  14. Event table

  15. User growth model design

User growth model sample answer:

321–340. Practice prompts

  1. Design ecommerce warehouse
  2. Design payments data model
  3. Design fraud event model
  4. Design vulnerability scoring model
  5. Design customer 360 model
  6. Design support ticket analytics
  7. Design employee attrition analytics
  8. Design subscription analytics
  9. Design IoT telemetry model
  10. Design clickstream model
  11. Design ad impressions model
  12. Design experiment metrics model
  13. Design inventory analytics
  14. Design order fulfillment analytics
  15. Design API usage analytics
  16. Design model for late-arriving data
  17. Design historical pricing model
  18. Design SCD for customer address
  19. Design audit trail schema
  20. Design data quality monitoring model

341–360. Modeling concepts

  1. Normalization
  2. Denormalization
  3. 3NF
  4. OLTP vs OLAP
  5. Partition strategy
  6. Clustering strategy
  7. Data skew considerations
  8. Null handling strategy
  9. Schema evolution
  10. Data versioning
  11. Data retention
  12. Archival strategy
  13. GDPR deletion considerations
  14. Multi-tenant design
  15. Hierarchical data modeling
  16. Graph modeling basics
  17. Semi-structured data modeling
  18. JSON column trade-offs
  19. Wide vs narrow tables
  20. Aggregated tables

361–400. Advanced practice prompts

  1. Lakehouse table design
  2. Iceberg partition evolution idea
  3. Hudi vs Delta vs Iceberg discussion
  4. Streaming event model
  5. Batch snapshot model
  6. Slowly changing relationships
  7. Data vault basics
  8. Domain-driven modeling
  9. KPI modeling
  10. Metric consistency layer
  11. Metadata model
  12. Data catalog entity model
  13. Ownership model
  14. Access control model
  15. Lineage model
  16. Compliance model
  17. Data contract model
  18. Feature store model
  19. Time-series partitioning
  20. Daily cohort model
  21. Retention cohort model
  22. Marketing attribution model
  23. Session fact design
  24. Revenue recognition model
  25. Refund adjustment model
  26. Currency conversion model
  27. Multi-region reporting model
  28. Slowly changing facts
  29. Event sourcing model
  30. CQRS idea in analytics
  31. Cross-domain metric reconciliation
  32. Data mart integration
  33. Semantic layer concept
  34. Snapshot reconciliation
  35. Late-arriving dimensions
  36. Missing dimension handling
  37. Unknown member row
  38. Historical restatement handling
  39. Modeling trade-offs
  40. How to explain a data model in interview

Section 4 β€” Data Architecture (401–500)

Core Answer Formula

For architecture questions use this structure:

  1. requirements
  2. scale
  3. latency
  4. data sources
  5. ingestion
  6. storage
  7. processing
  8. serving
  9. governance
  10. observability
  11. failure handling
  12. trade-offs

  13. Batch vs streaming

Short answer: Batch processes data in chunks; streaming processes continuously with low latency.

Long answer: Choose batch when cost and simplicity matter more than immediacy. Choose streaming when decisions must be made in near real time, such as fraud or threat detection.

  1. Kafka basics

  2. Topic, partition, offset

  3. Producer vs consumer

  4. Consumer group

  5. Exactly-once concept

  6. At-least-once vs at-most-once

  7. Backpressure

  8. Dead-letter queue

  9. Idempotency

  10. CDC pipeline design

  11. API ingestion design

  12. Batch pipeline design

  13. Real-time alerting pipeline

  14. Lambda vs Kappa

  15. Spark architecture basics

  16. Trino architecture basics

  17. Iceberg basics

  18. Delta basics

  19. Lakehouse concept

421–440. Governance and scale

  1. Data lake design
  2. Data warehouse design
  3. Metadata layer
  4. OpenMetadata concept
  5. Data lineage
  6. Data catalog
  7. Data contracts
  8. RBAC
  9. ABAC
  10. OPA policy engine concept
  11. Row-level security
  12. Column-level security
  13. Encryption at rest and in transit
  14. Masking and tokenization
  15. Audit logging
  16. Monitoring and observability
  17. SLA vs SLO
  18. Reliability design
  19. Fault tolerance
  20. High availability

441–460. Platform design practice prompts

  1. Design a real-time fraud platform
  2. Design a vulnerability intelligence pipeline
  3. Design a clickstream analytics platform
  4. Design a customer 360 lakehouse
  5. Design a federated query layer
  6. Design a metadata-driven governance system
  7. Design a CDC ingestion system
  8. Design an API-to-lake ingestion system
  9. Design a cost-optimized analytics platform
  10. Design multi-tenant data platform
  11. Design low-latency dashboard backend
  12. Design event replay system
  13. Design schema evolution strategy
  14. Design data quality framework
  15. Design lineage and audit system
  16. Design cross-region data platform
  17. Design hybrid cloud data stack
  18. Design role-based access at query layer
  19. Design row filters for sensitive data
  20. Design platform for 1000+ users and 200k queries/day

461–480. Infra and operations

  1. Kubernetes basics for data platform
  2. Autoscaling
  3. Resource isolation
  4. Multi-cluster strategy
  5. Deployment strategies
  6. Blue-green deployment
  7. Canary deployment
  8. CI/CD for data platform
  9. Secrets management
  10. Disaster recovery
  11. Backup and restore
  12. Chaos testing idea
  13. Performance tuning
  14. Throughput vs latency
  15. Caching strategies
  16. Materialized views
  17. Query acceleration
  18. File compaction
  19. Small files problem
  20. Partition evolution

481–500. Executive-level trade-off questions

  1. Batch vs stream trade-offs
  2. Spark vs Trino trade-offs
  3. Iceberg vs Delta trade-offs
  4. Open table format benefits
  5. Federation vs ingestion trade-offs
  6. Build vs buy for metadata platform
  7. Lakehouse vs warehouse trade-offs
  8. Pushdown vs centralized compute
  9. Governance vs developer agility
  10. Cost vs latency
  11. Reliability vs speed of delivery
  12. Multi-tenant isolation trade-offs
  13. Real-time vs simplicity
  14. API integration challenges
  15. Schema drift handling
  16. Platform observability priorities
  17. Security-first architecture
  18. Migration from legacy Hadoop to modern stack
  19. How to present architecture in interview
  20. How to answer follow-up challenges confidently

Rapid Hands-On Exercise Plan

SQL Drills

  1. Create 5 small tables in any SQL environment.
  2. Practice 3 joins, 3 windows, 3 aggregations daily.
  3. For every query, explain output before running it.
  4. After running it, validate row count and logic.

Coding Drills

  1. Solve 10 core problems only, repeatedly.
  2. Write in Python from memory.
  3. For each problem say:
    • brute force
    • optimized
    • complexity
    • edge cases
  4. Recode without looking.

Modeling Drills

  1. Pick one business case.
  2. Define grain first.
  3. Draw fact and dimensions on paper.
  4. Explain late-arriving data and history.
  5. Explain quality checks.

Architecture Drills

  1. State requirements first.
  2. Draw ingestion, storage, compute, serving.
  3. Add governance and monitoring.
  4. Add failure scenarios.
  5. Add trade-offs.

Final 3-Day Prioritization

If you cannot finish all 500, do these first:

Must-Win SQL

Must-Win Coding

Must-Win Modeling

Must-Win Architecture


Interview Speaking Rules


Personal Focus For You

Because you said Java is stronger and Python is weaker:

Example interview line:

I usually think in Java-style data structures, but I’ll code this in simple Python for speed and readability.

That is completely acceptable.