Data Engineering Interview Crash Workbook

3-Day Intensive Prep for Salesforce, Google, Snowflake, Databricks

How To Use This Workbook
3-Day Study Plan
Master Question Bank With Answer Patterns
Section 1: SQL
Section 2: Coding
Section 3: Data Modeling
Section 4: Data Architecture
Rapid Hands-On Exercise Plan
Final 3-Day Prioritization
Interview Speaking Rules
Personal Focus For You

How To Use This Workbook

You have 3 days, so do not try to memorize 500 disconnected answers. Use this workbook in this order for every topic:

Understand the idea
Say the short answer out loud
Expand into the long answer
Do the exercise by hand
Code or query it yourself
Explain the trade-off

For each topic below, use this template:

What is it?
Why do we use it?
When should we use it?
What are the trade-offs?
How would I implement it?
What can go wrong?

3-Day Study Plan

Day 1 — SQL + Data Modeling

Morning

Window functions
GROUP BY / HAVING
Joins
CTEs
Deduplication

Afternoon

Fact vs dimension
Star vs snowflake
SCD types
Partitioning
User growth modeling

Evening

Solve 15 SQL problems
Explain 5 modeling designs aloud

Day 2 — Coding + Python Basics

Morning

Arrays
Strings
Hash maps
Stack / queue
Big-O

Afternoon

Linked list
Trees
Graphs
BFS / DFS
Sliding window / two pointers

Evening

Solve 10 coding problems
Rewrite in Python without looking

Day 3 — Architecture + Behavioral + Mock Interviews

Morning

Batch vs streaming
Kafka
Data lake / lakehouse
Trino / Spark / Iceberg
Governance / lineage / metadata

Afternoon

System design drills
Data pipeline design
API ingestion design
CDC design

Evening

Behavioral answers
2 full mock rounds
Rapid-fire review

Master Question Bank With Answer Patterns

The 500 questions are grouped into patterns. Master the pattern once, then you can answer 10–20 related questions.

Section 1 — SQL (1–150)

A. Window Functions (1–25)

What is a window function?

Short answer: A window function calculates a value across related rows without collapsing the result set.

Long answer: Unlike GROUP BY, which reduces rows, a window function keeps each row and adds analytical context such as rank, running total, moving average, or previous/next row comparison. It usually uses OVER(), optionally with PARTITION BY and ORDER BY.

Exercise: Take a sales table with columns (employee_id, sale_date, amount). Write a query that shows the running total by employee.

SELECT
  employee_id,
  sale_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY employee_id
    ORDER BY sale_date
  ) AS running_total
FROM sales;

Understand it:

PARTITION BY resets calculation per employee
ORDER BY defines sequence
rows stay visible

Difference between GROUP BY and window function

Short answer: GROUP BY collapses rows; window functions do not.

Long answer: Use GROUP BY when you need one row per group. Use window functions when you need row-level detail plus aggregated context.

Exercise: Compare these two queries on the same table and explain why one returns fewer rows.

ROW_NUMBER vs RANK vs DENSE_RANK

Short answer: ROW_NUMBER always unique; RANK leaves gaps on ties; DENSE_RANK does not.

Long answer:

ROW_NUMBER: best for deduplication
RANK: best when positional gaps are acceptable
DENSE_RANK: best when tied values should share the same rank without skipping the next rank

Exercise: For salary values 100, 100, 90, write all three rankings and explain the output.

Running total

Short answer: Use SUM(…) OVER (ORDER BY …).

Exercise:

SELECT order_date,
       revenue,
       SUM(revenue) OVER (ORDER BY order_date) AS running_revenue
FROM daily_revenue;

Moving average

Short answer: Use a frame clause such as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.

Exercise:

SELECT order_date,
       AVG(revenue) OVER (
         ORDER BY order_date
         ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
       ) AS moving_avg_3
FROM daily_revenue;

Understand it: This is a 3-row moving average, not always 3 calendar days.

LEAD vs LAG

Short answer: LAG looks backward; LEAD looks forward.

Exercise: Find time between two consecutive user events.

SELECT
  user_id,
  event_time,
  event_time - LAG(event_time) OVER (
    PARTITION BY user_id ORDER BY event_time
  ) AS gap_from_previous
FROM events;

FIRST_VALUE and LAST_VALUE

Short answer: Return first or last value in the window.

Important trap: LAST_VALUE often needs a frame clause, otherwise it may not behave as expected.

NTILE

Short answer: Splits ordered rows into buckets.

Use case: quartiles, deciles, customer segmentation.

PERCENT_RANK

Short answer: Gives relative rank from 0 to 1.

CUME_DIST

Short answer: Cumulative distribution showing fraction of rows less than or equal to current row.

What is PARTITION BY?

Short answer: It defines groups for the window function.

What is ORDER BY in a window?

Short answer: It defines processing sequence inside each partition.

What is a frame clause?

Short answer: It narrows which rows in the ordered partition are included.

Deduplicate records using ROW_NUMBER

Short answer: Assign row numbers per business key, keep rn = 1.

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC
         ) AS rn
  FROM customer_records
)
SELECT *
FROM ranked
WHERE rn = 1;

Understand it: This keeps the latest record per customer.

Top N per group

Short answer: Use ROW_NUMBER() or DENSE_RANK() with PARTITION BY.

Detect gaps in dates or IDs

Short answer: Compare current row with previous row using LAG.

Sessionization

Short answer: Split user activity into sessions based on inactivity threshold.

Long answer: Use LAG to compare current event time with previous event time. If the gap exceeds a threshold like 30 minutes, mark it as a new session. Then use cumulative sum to generate session IDs.

WITH tagged AS (
  SELECT *,
         CASE
           WHEN event_time - LAG(event_time) OVER (
             PARTITION BY user_id ORDER BY event_time
           ) > INTERVAL '30' MINUTE
           OR LAG(event_time) OVER (
             PARTITION BY user_id ORDER BY event_time
           ) IS NULL
           THEN 1 ELSE 0
         END AS new_session
  FROM events
), sessions AS (
  SELECT *,
         SUM(new_session) OVER (
           PARTITION BY user_id ORDER BY event_time
         ) AS session_id
  FROM tagged
)
SELECT * FROM sessions;

Exercise: Change threshold from 30 minutes to 15 minutes and explain what changes.

18–25. Practice prompts

Rank employees within department by salary
Find first purchase date per customer
Find repeat purchase gap
Rolling 7-day revenue
Top 3 products per category
Detect change in subscription plan
Compare current and previous order amount
Bucket customers into revenue quartiles

For each one, answer with:

the window function needed
partition column
order column
output meaning

B. Joins (26–50)

Sample tables used in the examples below

customers

customer_id	customer_name	region
1	Alice	EU
2	Bob	US
3	Carol	EU
4	Dan	APAC

orders

order_id	customer_id	region	amount
101	1	EU	120
102	1	EU	80
103	2	US	75
104	5	US	60

employees

employee_id	employee_name	manager_id
10	Eva	null
11	Frank	10
12	Grace	10

What is an INNER JOIN?

Short answer: Returns only matching rows from both tables.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id,
  o.amount
FROM customers c
INNER JOIN orders o
  ON c.customer_id = o.customer_id;

Result:

customer_id	customer_name	order_id	amount
1	Alice	101	120
1	Alice	102	80
2	Bob	103	75

What is a LEFT JOIN?

Short answer: Returns all rows from the left table and matching rows from the right.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

Result:

customer_id	customer_name	order_id
1	Alice	101
1	Alice	102
2	Bob	103
3	Carol	null
4	Dan	null

Difference between LEFT JOIN and INNER JOIN

Short answer: LEFT keeps unmatched left-side rows; INNER removes them.

RIGHT JOIN

Short answer: Same logic as LEFT JOIN but preserves right table rows.

FULL OUTER JOIN

Short answer: Returns all matching and non-matching rows from both tables.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id,
  o.amount
FROM customers c
FULL OUTER JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

Result:

customer_id	customer_name	order_id	amount
1	Alice	101	120
1	Alice	102	80
2	Bob	103	75
3	Carol	null	null
4	Dan	null	null
null	null	104	60

Understand it: this keeps unmatched rows from both sides.

CROSS JOIN

Short answer: Cartesian product of both tables.

Sample tables for cross join

colors

color
Red
Blue

sizes

size
S
M
L

Example:

SELECT
  c.color,
  s.size
FROM colors c
CROSS JOIN sizes s
ORDER BY c.color, s.size;

Result:

color	size
Blue	L
Blue	M
Blue	S
Red	L
Red	M
Red	S

Understand it: every row in colors pairs with every row in sizes, so 2 x 3 = 6 rows.

SELF JOIN

Short answer: A table joined to itself, often for hierarchy or comparisons.

Example:

SELECT
  e.employee_name,
  m.employee_name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.employee_id;

Result:

employee_name	manager_name
Eva	null
Frank	Eva
Grace	Eva

Understand it: the same employees table plays two roles here:

e is the employee
m is that employee’s manager

Anti join

Short answer: Find rows in A not in B.

SELECT c.*
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Result: Carol and Dan

Semi join

Short answer: Return rows in A that have a match in B, without bringing B columns.

SELECT
  c.customer_id,
  c.customer_name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE c.customer_id = o.customer_id
);

Result:

customer_id	customer_name
1	Alice
2	Bob

Understand it: unlike a regular join, this only answers “does a match exist?” and does not return columns from orders.

Anti join vs LEFT JOIN

Short answer: LEFT JOIN keeps all rows from the left table; ANTI JOIN keeps only left rows that do not match.

Example LEFT JOIN:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;

LEFT JOIN result:

customer_id	customer_name	order_id
1	Alice	101
1	Alice	102
2	Bob	103
3	Carol	null
4	Dan	null

Example ANTI JOIN pattern:

SELECT
  c.customer_id,
  c.customer_name
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

ANTI JOIN result:

customer_id	customer_name
3	Carol
4	Dan

Understand it: anti join is usually written using LEFT JOIN ... WHERE right_key IS NULL or NOT EXISTS (...).

What causes duplicate rows after a join?

Short answer: Non-unique join keys.

Understand it: If one customer has 3 orders and 2 payments, a direct join may create 6 rows.

How do you fix join duplication?

Short answer: Understand cardinality, pre-aggregate, or deduplicate before joining.

What is join skew?

Short answer: Uneven key distribution causes one node or partition to process far more data.

Broadcast join vs shuffle join

Short answer: Broadcast sends small table to all workers; shuffle redistributes both sides by key.

When should you use a broadcast join?

Short answer: When one side is small enough to replicate efficiently.

Joining on multiple keys

Exercise: Write a join on (customer_id, region) and explain why using only one column is dangerous.

Example:

SELECT
  c.customer_id,
  c.customer_name,
  o.order_id
FROM customers c
JOIN orders o
  ON c.customer_id = o.customer_id
 AND c.region = o.region;

Why this matters: if customer_id is reused across regions or the business key is composite, joining on only customer_id can create incorrect matches.

Non-equi join

Short answer: Join condition uses <, >, BETWEEN, or other non-equality conditions.

Mini example: join orders to a pricing-band table using amount BETWEEN min_amount AND max_amount.

Join order and performance

Short answer: Filter early, shrink large tables before joining.

Null behavior in joins

Short answer: NULL = NULL is not true, so null keys do not match in standard equality joins.

How do you debug a bad join?

Checklist:

check key uniqueness
count rows before and after
inspect nulls
verify join type
sample exploding keys

Quick debug query:

SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this returns rows, you already know that joining customers to orders can create multiple rows per customer unless that is the intended output grain.

45–50. Practice prompts

Find customers with no orders
Find orders with missing product reference
Join employee with manager using self join
Explain many-to-many join explosion
Compare EXISTS vs IN
Reduce join cost on large datasets

C. Aggregations (51–75)

COUNT(*) vs COUNT(column)

Short answer: COUNT(*) counts all rows; COUNT(column) ignores nulls.

What does GROUP BY do?

Short answer: Groups rows by one or more columns before aggregation.

WHERE vs HAVING

Short answer: WHERE filters rows before grouping; HAVING filters groups after aggregation.

Conditional aggregation

Short answer: Use SUM(CASE WHEN … THEN 1 ELSE 0 END) or similar.

SELECT
  department,
  SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) AS active_count
FROM employees
GROUP BY department;

DISTINCT count

Short answer: COUNT(DISTINCT user_id) counts unique users.

Approx distinct count

Short answer: Used for large-scale systems when exact distinct count is expensive.

ROLLUP

Short answer: Adds subtotal and grand total rows.

CUBE

Short answer: Produces all combinations of subtotals.

GROUPING SETS

Short answer: Lets you explicitly choose multiple grouping combinations.

Median

Short answer: Some engines support percentile_cont(0.5).

Percentile

Short answer: Used to understand distribution, not just average.

Why is AVG sometimes misleading?

Short answer: Outliers can distort it; median may be better.

Nulls in aggregations

Short answer: Most aggregates ignore nulls except COUNT(*).

Aggregation after join trap

Short answer: If a join duplicates rows, your sums and counts become wrong.

65–75. Practice prompts

Daily active users
Revenue by region
Monthly cumulative sales
Ratio of active to total users
95th percentile latency
Distinct customers by month
Product revenue share
Top category by revenue
Countries with declining growth
Average order value
Fraud count by severity

D. CTEs and Subqueries (76–100)

What is a CTE?

Short answer: A named temporary result set used inside a query.

Why use CTEs?

Short answer: Readability, reusability, step-wise logic.

Recursive CTE

Short answer: A CTE that references itself, used for hierarchies or graph-like traversal.

Subquery vs CTE

Short answer: Similar logically; CTE is often easier to read and debug.

Correlated subquery

Short answer: A subquery that depends on values from the outer query.

EXISTS vs IN

Short answer: Both check membership; EXISTS is often clearer and can be better for correlated logic.

Scalar subquery

Short answer: Returns one value, such as a max date.

Common CTE interview pattern

Pattern:

Filter base data
Rank or aggregate
Join results
Final filter

84–100. Practice prompts

Latest record per customer
Org hierarchy using recursive CTE
Orders above customer average
Users with no transactions
Products never sold
Most recent login per user
Monthly cohort extraction
Nested CTE refactor
CTE + window combination
Recursive tree traversal
Parent-child path building
Subquery to CTE rewrite
Use EXISTS for filtering
Find max salary per department
Find rows above department average
Query decomposition strategy
Explain readability vs performance trade-off

E. Advanced SQL (101–150)

For these, use the answer pattern below.

Short answer template: Define it in one sentence.
Long answer template: Explain use case, benefit, and one limitation.
Hands-on template: Write one query or pseudo-query.

Deduplication strategies
CDC query logic
Snapshot vs incremental query
Merge / upsert pattern
Time travel query
JSON parsing in SQL
Arrays in SQL
Struct / nested fields
Regex filtering
Data masking
Row-level security
Column masking
Predicate pushdown
Partition pruning
File skipping
Explain plan basics
Cost-based optimization
Materialized views
Query caching
Resource groups
Query federation
Trino pushdown behavior
Snowflake clustering basics
BigQuery partitioning basics
Iceberg metadata tables
Delta transaction log idea
Null-safe comparisons
Merge late-arriving data
Validate counts after load
Build audit queries
Find broken foreign keys
Data quality assertions
Window performance concerns
Join performance concerns
Aggregation skew
Snapshot isolation concept
Query retry strategy
Fault-tolerant SQL pipelines
Access filters
Secure data views
Multi-tenant filtering
Dynamic SQL risks
SQL injection basics
Query observability
Debug wrong result sets
Debug row explosion
Debug null issues
Query anti-patterns
SQL interview best practices
How to talk through a SQL solution live

Section 2 — Coding (151–300)

Use this answer format for all coding problems:

brute-force idea
optimized idea
time complexity
space complexity
edge cases
clean code

A. Arrays and Strings (151–180)

Reverse a string

Short answer: Use slicing in Python or two pointers.

def reverse_string(s: str) -> str:
    return s[::-1]

Palindrome check

def is_palindrome(s: str) -> bool:
    cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())
    return cleaned == cleaned[::-1]

Two Sum

Short answer: Use a hash map for O(n).

def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        need = target - n
        if need in seen:
            return [seen[need], i]
        seen[n] = i
    return []

Understand it: store past values, not future values.

Three Sum

Short answer: Sort + two pointers.

Longest substring without repeating characters

Short answer: Sliding window + set/map.

Maximum subarray

Short answer: Kadane’s algorithm.

Merge sorted arrays
Remove duplicates from sorted array
Rotate array
Prefix sum
Difference array idea
Group anagrams
Valid anagram
String compression
Character frequency map
Substring search
Pattern matching
Sliding window max
Longest common prefix
Roman to integer
Integer to Roman
String to integer
Reverse words in sentence
Product of array except self
Find missing number
Best time to buy and sell stock
Majority element
Merge intervals
Insert interval
Meeting rooms overlap

For each one, do:

write brute force
optimize
code it
say time/space

B. Stack, Queue, Linked List (181–210)

Valid parentheses

def is_valid(s: str) -> bool:
    pairs = {')': '(', ']': '[', '}': '{'}
    stack = []
    for ch in s:
        if ch in pairs.values():
            stack.append(ch)
        else:
            if not stack or stack[-1] != pairs.get(ch):
                return False
            stack.pop()
    return not stack

Min stack
Queue using two stacks
Stack using queues
Reverse linked list

class ListNode:
    def __init__(self, val=0, next=None):
        self.val = val
        self.next = next

def reverse_list(head):
    prev = None
    curr = head
    while curr:
        nxt = curr.next
        curr.next = prev
        prev = curr
        curr = nxt
    return prev

Detect cycle in linked list
Merge two sorted linked lists
Find middle of linked list
Remove nth node from end
Reorder list
LRU cache concept
Circular queue
Deque use cases
Monotonic stack
Next greater element
Daily temperatures
Largest rectangle in histogram
Trapping rain water
Evaluate reverse polish notation
Sliding window with deque

201–210. Practice repeats and variations

C. Trees and Graphs (211–260)

Binary tree traversal

Short answer:

preorder: root left right
inorder: left root right
postorder: left right root

BFS vs DFS

Short answer: BFS uses queue, DFS uses stack/recursion.

Maximum depth of binary tree
Invert binary tree
Validate BST
Lowest common ancestor
Level order traversal
Tree diameter
Path sum
Serialize and deserialize tree
Number of islands
Clone graph
Course schedule
Topological sort
Detect cycle in graph
Connected components
Shortest path in unweighted graph
Dijkstra basics
Union find basics
Bipartite graph check

231–260. Practice variants

Hands-on rule: Draw the tree or graph before coding.

D. Recursion, Backtracking, DP, Search (261–300)

Fibonacci recursion vs DP
Climbing stairs
Coin change
Longest common subsequence
Edit distance
Knapsack idea
Subsets
Permutations
Combination sum
N-Queens idea
Binary search
Search in rotated sorted array
Find first and last position
Median of two sorted arrays idea
Kth largest element using heap
Top K frequent elements
Heap basics
Recursion tree thinking
Memoization vs tabulation
State definition in DP

281–300. Mixed practice

Interview speaking tip: Always say:

“Let me start with brute force.”
“Now I’ll optimize.”
“Here’s the complexity.”
“Let’s test edge cases.”

Section 3 — Data Modeling (301–400)

Core Answer Formula

For every modeling question answer in this order:

business goal
grain
dimensions
facts
keys
update pattern
quality checks
performance considerations
Fact vs dimension

Short answer: Facts store measurable events; dimensions store descriptive context.

Star schema

Short answer: Fact table in center with denormalized dimensions around it.

Snowflake schema

Short answer: Dimensions are normalized into multiple related tables.

When to use star vs snowflake

Short answer: Star for analytics simplicity and speed; snowflake for reduced redundancy and stricter modeling.

What is grain?

Short answer: The exact level of detail represented by one row in a fact table.

Example: one row per order line, not one row per order.

Surrogate key vs natural key
SCD Type 1
SCD Type 2
SCD Type 3
Factless fact table
Degenerate dimension
Junk dimension
Conformed dimension
Bridge table
Role-playing dimension
Snapshot table
Accumulating snapshot
Periodic snapshot
Event table
User growth model design

User growth model sample answer:

business goal: track new, active, retained users over time
grain: one row per user per day in activity fact or one row per signup event
facts: login_count, active_flag, session_count
dimensions: date, user, geography, device
metrics: DAU, WAU, MAU, cumulative users

321–340. Practice prompts

Design ecommerce warehouse
Design payments data model
Design fraud event model
Design vulnerability scoring model
Design customer 360 model
Design support ticket analytics
Design employee attrition analytics
Design subscription analytics
Design IoT telemetry model
Design clickstream model
Design ad impressions model
Design experiment metrics model
Design inventory analytics
Design order fulfillment analytics
Design API usage analytics
Design model for late-arriving data
Design historical pricing model
Design SCD for customer address
Design audit trail schema
Design data quality monitoring model

341–360. Modeling concepts

Normalization
Denormalization
3NF
OLTP vs OLAP
Partition strategy
Clustering strategy
Data skew considerations
Null handling strategy
Schema evolution
Data versioning
Data retention
Archival strategy
GDPR deletion considerations
Multi-tenant design
Hierarchical data modeling
Graph modeling basics
Semi-structured data modeling
JSON column trade-offs
Wide vs narrow tables
Aggregated tables

361–400. Advanced practice prompts

Lakehouse table design
Iceberg partition evolution idea
Hudi vs Delta vs Iceberg discussion
Streaming event model
Batch snapshot model
Slowly changing relationships
Data vault basics
Domain-driven modeling
KPI modeling
Metric consistency layer
Metadata model
Data catalog entity model
Ownership model
Access control model
Lineage model
Compliance model
Data contract model
Feature store model
Time-series partitioning
Daily cohort model
Retention cohort model
Marketing attribution model
Session fact design
Revenue recognition model
Refund adjustment model
Currency conversion model
Multi-region reporting model
Slowly changing facts
Event sourcing model
CQRS idea in analytics
Cross-domain metric reconciliation
Data mart integration
Semantic layer concept
Snapshot reconciliation
Late-arriving dimensions
Missing dimension handling
Unknown member row
Historical restatement handling
Modeling trade-offs
How to explain a data model in interview

Section 4 — Data Architecture (401–500)

Core Answer Formula

For architecture questions use this structure:

requirements
scale
latency
data sources
ingestion
storage
processing
serving
governance
observability
failure handling
trade-offs
Batch vs streaming

Short answer: Batch processes data in chunks; streaming processes continuously with low latency.

Long answer: Choose batch when cost and simplicity matter more than immediacy. Choose streaming when decisions must be made in near real time, such as fraud or threat detection.

Kafka basics
Topic, partition, offset
Producer vs consumer
Consumer group
Exactly-once concept
At-least-once vs at-most-once
Backpressure
Dead-letter queue
Idempotency
CDC pipeline design
API ingestion design
Batch pipeline design
Real-time alerting pipeline
Lambda vs Kappa
Spark architecture basics
Trino architecture basics
Iceberg basics
Delta basics
Lakehouse concept

421–440. Governance and scale

Data lake design
Data warehouse design
Metadata layer
OpenMetadata concept
Data lineage
Data catalog
Data contracts
RBAC
ABAC
OPA policy engine concept
Row-level security
Column-level security
Encryption at rest and in transit
Masking and tokenization
Audit logging
Monitoring and observability
SLA vs SLO
Reliability design
Fault tolerance
High availability

441–460. Platform design practice prompts

Design a real-time fraud platform
Design a vulnerability intelligence pipeline
Design a clickstream analytics platform
Design a customer 360 lakehouse
Design a federated query layer
Design a metadata-driven governance system
Design a CDC ingestion system
Design an API-to-lake ingestion system
Design a cost-optimized analytics platform
Design multi-tenant data platform
Design low-latency dashboard backend
Design event replay system
Design schema evolution strategy
Design data quality framework
Design lineage and audit system
Design cross-region data platform
Design hybrid cloud data stack
Design role-based access at query layer
Design row filters for sensitive data
Design platform for 1000+ users and 200k queries/day

461–480. Infra and operations

Kubernetes basics for data platform
Autoscaling
Resource isolation
Multi-cluster strategy
Deployment strategies
Blue-green deployment
Canary deployment
CI/CD for data platform
Secrets management
Disaster recovery
Backup and restore
Chaos testing idea
Performance tuning
Throughput vs latency
Caching strategies
Materialized views
Query acceleration
File compaction
Small files problem
Partition evolution

481–500. Executive-level trade-off questions

Batch vs stream trade-offs
Spark vs Trino trade-offs
Iceberg vs Delta trade-offs
Open table format benefits
Federation vs ingestion trade-offs
Build vs buy for metadata platform
Lakehouse vs warehouse trade-offs
Pushdown vs centralized compute
Governance vs developer agility
Cost vs latency
Reliability vs speed of delivery
Multi-tenant isolation trade-offs
Real-time vs simplicity
API integration challenges
Schema drift handling
Platform observability priorities
Security-first architecture
Migration from legacy Hadoop to modern stack
How to present architecture in interview
How to answer follow-up challenges confidently

Rapid Hands-On Exercise Plan

SQL Drills

Create 5 small tables in any SQL environment.
Practice 3 joins, 3 windows, 3 aggregations daily.
For every query, explain output before running it.
After running it, validate row count and logic.

Coding Drills

Solve 10 core problems only, repeatedly.
Write in Python from memory.
For each problem say:
- brute force
- optimized
- complexity
- edge cases
Recode without looking.

Modeling Drills

Pick one business case.
Define grain first.
Draw fact and dimensions on paper.
Explain late-arriving data and history.
Explain quality checks.

Architecture Drills

State requirements first.
Draw ingestion, storage, compute, serving.
Add governance and monitoring.
Add failure scenarios.
Add trade-offs.

Final 3-Day Prioritization

If you cannot finish all 500, do these first:

Must-Win SQL

windows
joins
aggregation
CTE
dedup
sessionization

Must-Win Coding

arrays
strings
hashmap
stack
queue
linked list
tree traversal
BFS/DFS
sliding window
binary search

Must-Win Modeling

grain
fact vs dimension
star vs snowflake
SCD2
snapshots
partitioning

Must-Win Architecture

batch vs stream
Kafka basics
lakehouse
Trino/Spark roles
governance
API ingestion
CDC
fault tolerance

Interview Speaking Rules

Start simple.
State assumptions.
Give brute force first for coding.
Give trade-offs for architecture.
Use one real example from your work whenever possible.
Do not jump into code before explaining.
Validate with a small test case.
End with complexity or trade-off summary.

Personal Focus For You

Because you said Java is stronger and Python is weaker:

Think solution in Java if needed
Speak algorithm in plain English
Code in simple Python
Avoid clever Python tricks
Use readable loops, dicts, lists, sets

Example interview line:

I usually think in Java-style data structures, but I’ll code this in simple Python for speed and readability.

That is completely acceptable.

Data Engineering Interview Crash Workbook

Table of Contents

How To Use This Workbook

3-Day Study Plan

Day 1 — SQL + Data Modeling

Morning

Afternoon

Evening

Day 2 — Coding + Python Basics

Morning

Afternoon

Evening

Day 3 — Architecture + Behavioral + Mock Interviews

Morning

Afternoon

Evening

Master Question Bank With Answer Patterns

Section 1 — SQL (1–150)

A. Window Functions (1–25)

B. Joins (26–50)

C. Aggregations (51–75)

D. CTEs and Subqueries (76–100)

E. Advanced SQL (101–150)

Section 2 — Coding (151–300)

A. Arrays and Strings (151–180)

B. Stack, Queue, Linked List (181–210)

C. Trees and Graphs (211–260)

D. Recursion, Backtracking, DP, Search (261–300)

Section 3 — Data Modeling (301–400)

Core Answer Formula

Section 4 — Data Architecture (401–500)

Core Answer Formula

Rapid Hands-On Exercise Plan

SQL Drills

Coding Drills

Modeling Drills

Architecture Drills

Final 3-Day Prioritization

Must-Win SQL

Must-Win Coding

Must-Win Modeling

Must-Win Architecture

Interview Speaking Rules

Personal Focus For You