Data Engineering Interview Crash Workbook
3-Day Intensive Prep for Salesforce, Google, Snowflake, Databricks
Table of Contents
- How To Use This Workbook
- 3-Day Study Plan
- Master Question Bank With Answer Patterns
- Section 1: SQL
- Section 2: Coding
- Section 3: Data Modeling
- Section 4: Data Architecture
- Rapid Hands-On Exercise Plan
- Final 3-Day Prioritization
- Interview Speaking Rules
- Personal Focus For You
How To Use This Workbook
You have 3 days, so do not try to memorize 500 disconnected answers. Use this workbook in this order for every topic:
- Understand the idea
- Say the short answer out loud
- Expand into the long answer
- Do the exercise by hand
- Code or query it yourself
- Explain the trade-off
For each topic below, use this template:
- What is it?
- Why do we use it?
- When should we use it?
- What are the trade-offs?
- How would I implement it?
- What can go wrong?
3-Day Study Plan
Day 1 β SQL + Data Modeling
Morning
- Window functions
- GROUP BY / HAVING
- Joins
- CTEs
- Deduplication
Afternoon
- Fact vs dimension
- Star vs snowflake
- SCD types
- Partitioning
- User growth modeling
Evening
- Solve 15 SQL problems
- Explain 5 modeling designs aloud
Day 2 β Coding + Python Basics
Morning
- Arrays
- Strings
- Hash maps
- Stack / queue
- Big-O
Afternoon
- Linked list
- Trees
- Graphs
- BFS / DFS
- Sliding window / two pointers
Evening
- Solve 10 coding problems
- Rewrite in Python without looking
Day 3 β Architecture + Behavioral + Mock Interviews
Morning
- Batch vs streaming
- Kafka
- Data lake / lakehouse
- Trino / Spark / Iceberg
- Governance / lineage / metadata
Afternoon
- System design drills
- Data pipeline design
- API ingestion design
- CDC design
Evening
- Behavioral answers
- 2 full mock rounds
- Rapid-fire review
Master Question Bank With Answer Patterns
The 500 questions are grouped into patterns. Master the pattern once, then you can answer 10β20 related questions.
Section 1 β SQL (1β150)
A. Window Functions (1β25)
- What is a window function?
Short answer: A window function calculates a value across related rows without collapsing the result set.
Long answer: Unlike GROUP BY, which reduces rows, a window function keeps each row and adds analytical context such as rank, running total, moving average, or previous/next row comparison. It usually uses OVER(), optionally with PARTITION BY and ORDER BY.
Exercise: Take a sales table with columns (employee_id, sale_date, amount). Write a query that shows the running total by employee.
SELECT
employee_id,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY employee_id
ORDER BY sale_date
) AS running_total
FROM sales;
Understand it:
- PARTITION BY resets calculation per employee
- ORDER BY defines sequence
- rows stay visible
- Difference between GROUP BY and window function
Short answer: GROUP BY collapses rows; window functions do not.
Long answer: Use GROUP BY when you need one row per group. Use window functions when you need row-level detail plus aggregated context.
Exercise: Compare these two queries on the same table and explain why one returns fewer rows.
- ROW_NUMBER vs RANK vs DENSE_RANK
Short answer: ROW_NUMBER always unique; RANK leaves gaps on ties; DENSE_RANK does not.
Long answer:
- ROW_NUMBER: best for deduplication
- RANK: best when positional gaps are acceptable
- DENSE_RANK: best when tied values should share the same rank without skipping the next rank
Exercise: For salary values 100, 100, 90, write all three rankings and explain the output.
- Running total
Short answer: Use SUM(β¦) OVER (ORDER BY β¦).
Exercise:
SELECT order_date,
revenue,
SUM(revenue) OVER (ORDER BY order_date) AS running_revenue
FROM daily_revenue;
- Moving average
Short answer: Use a frame clause such as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.
Exercise:
SELECT order_date,
AVG(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3
FROM daily_revenue;
Understand it: This is a 3-row moving average, not always 3 calendar days.
- LEAD vs LAG
Short answer: LAG looks backward; LEAD looks forward.
Exercise: Find time between two consecutive user events.
SELECT
user_id,
event_time,
event_time - LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) AS gap_from_previous
FROM events;
- FIRST_VALUE and LAST_VALUE
Short answer: Return first or last value in the window.
Important trap: LAST_VALUE often needs a frame clause, otherwise it may not behave as expected.
- NTILE
Short answer: Splits ordered rows into buckets.
Use case: quartiles, deciles, customer segmentation.
- PERCENT_RANK
Short answer: Gives relative rank from 0 to 1.
- CUME_DIST
Short answer: Cumulative distribution showing fraction of rows less than or equal to current row.
- What is PARTITION BY?
Short answer: It defines groups for the window function.
- What is ORDER BY in a window?
Short answer: It defines processing sequence inside each partition.
- What is a frame clause?
Short answer: It narrows which rows in the ordered partition are included.
- Deduplicate records using ROW_NUMBER
Short answer: Assign row numbers per business key, keep rn = 1.
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) AS rn
FROM customer_records
)
SELECT *
FROM ranked
WHERE rn = 1;
Understand it: This keeps the latest record per customer.
- Top N per group
Short answer: Use ROW_NUMBER() or DENSE_RANK() with PARTITION BY.
- Detect gaps in dates or IDs
Short answer: Compare current row with previous row using LAG.
- Sessionization
Short answer: Split user activity into sessions based on inactivity threshold.
Long answer: Use LAG to compare current event time with previous event time. If the gap exceeds a threshold like 30 minutes, mark it as a new session. Then use cumulative sum to generate session IDs.
WITH tagged AS (
SELECT *,
CASE
WHEN event_time - LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) > INTERVAL '30' MINUTE
OR LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) IS NULL
THEN 1 ELSE 0
END AS new_session
FROM events
), sessions AS (
SELECT *,
SUM(new_session) OVER (
PARTITION BY user_id ORDER BY event_time
) AS session_id
FROM tagged
)
SELECT * FROM sessions;
Exercise: Change threshold from 30 minutes to 15 minutes and explain what changes.
18β25. Practice prompts
- Rank employees within department by salary
- Find first purchase date per customer
- Find repeat purchase gap
- Rolling 7-day revenue
- Top 3 products per category
- Detect change in subscription plan
- Compare current and previous order amount
- Bucket customers into revenue quartiles
For each one, answer with:
- the window function needed
- partition column
- order column
- output meaning
B. Joins (26β50)
Sample tables used in the examples below
customers
| customer_id | customer_name | region |
|---|---|---|
| 1 | Alice | EU |
| 2 | Bob | US |
| 3 | Carol | EU |
| 4 | Dan | APAC |
orders
| order_id | customer_id | region | amount |
|---|---|---|---|
| 101 | 1 | EU | 120 |
| 102 | 1 | EU | 80 |
| 103 | 2 | US | 75 |
| 104 | 5 | US | 60 |
employees
| employee_id | employee_name | manager_id |
|---|---|---|
| 10 | Eva | null |
| 11 | Frank | 10 |
| 12 | Grace | 10 |
- What is an INNER JOIN?
Short answer: Returns only matching rows from both tables.
Example:
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.amount
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;
Result:
| customer_id | customer_name | order_id | amount |
|---|---|---|---|
| 1 | Alice | 101 | 120 |
| 1 | Alice | 102 | 80 |
| 2 | Bob | 103 | 75 |
- What is a LEFT JOIN?
Short answer: Returns all rows from the left table and matching rows from the right.
Example:
SELECT
c.customer_id,
c.customer_name,
o.order_id
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;
Result:
| customer_id | customer_name | order_id |
|---|---|---|
| 1 | Alice | 101 |
| 1 | Alice | 102 |
| 2 | Bob | 103 |
| 3 | Carol | null |
| 4 | Dan | null |
- Difference between LEFT JOIN and INNER JOIN
Short answer: LEFT keeps unmatched left-side rows; INNER removes them.
- RIGHT JOIN
Short answer: Same logic as LEFT JOIN but preserves right table rows.
- FULL OUTER JOIN
Short answer: Returns all matching and non-matching rows from both tables.
Example:
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.amount
FROM customers c
FULL OUTER JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;
Result:
| customer_id | customer_name | order_id | amount |
|---|---|---|---|
| 1 | Alice | 101 | 120 |
| 1 | Alice | 102 | 80 |
| 2 | Bob | 103 | 75 |
| 3 | Carol | null | null |
| 4 | Dan | null | null |
| null | null | 104 | 60 |
Understand it: this keeps unmatched rows from both sides.
- CROSS JOIN
Short answer: Cartesian product of both tables.
Sample tables for cross join
colors
| color |
|---|
| Red |
| Blue |
sizes
| size |
|---|
| S |
| M |
| L |
Example:
SELECT
c.color,
s.size
FROM colors c
CROSS JOIN sizes s
ORDER BY c.color, s.size;
Result:
| color | size |
|---|---|
| Blue | L |
| Blue | M |
| Blue | S |
| Red | L |
| Red | M |
| Red | S |
Understand it: every row in colors pairs with every row in sizes, so 2 x 3 = 6 rows.
- SELF JOIN
Short answer: A table joined to itself, often for hierarchy or comparisons.
Example:
SELECT
e.employee_name,
m.employee_name AS manager_name
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.employee_id;
Result:
| employee_name | manager_name |
|---|---|
| Eva | null |
| Frank | Eva |
| Grace | Eva |
Understand it: the same employees table plays two roles here:
eis the employeemis that employeeβs manager
- Anti join
Short answer: Find rows in A not in B.
SELECT c.*
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
Result: Carol and Dan
- Semi join
Short answer: Return rows in A that have a match in B, without bringing B columns.
SELECT
c.customer_id,
c.customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE c.customer_id = o.customer_id
);
Result:
| customer_id | customer_name |
|---|---|
| 1 | Alice |
| 2 | Bob |
Understand it: unlike a regular join, this only answers βdoes a match exist?β and does not return columns from orders.
Anti join vs LEFT JOIN
Short answer: LEFT JOIN keeps all rows from the left table; ANTI JOIN keeps only left rows that do not match.
Example LEFT JOIN:
SELECT
c.customer_id,
c.customer_name,
o.order_id
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;
LEFT JOIN result:
| customer_id | customer_name | order_id |
|---|---|---|
| 1 | Alice | 101 |
| 1 | Alice | 102 |
| 2 | Bob | 103 |
| 3 | Carol | null |
| 4 | Dan | null |
Example ANTI JOIN pattern:
SELECT
c.customer_id,
c.customer_name
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
ANTI JOIN result:
| customer_id | customer_name |
|---|---|
| 3 | Carol |
| 4 | Dan |
Understand it: anti join is usually written using LEFT JOIN ... WHERE right_key IS NULL or NOT EXISTS (...).
- What causes duplicate rows after a join?
Short answer: Non-unique join keys.
Understand it: If one customer has 3 orders and 2 payments, a direct join may create 6 rows.
- How do you fix join duplication?
Short answer: Understand cardinality, pre-aggregate, or deduplicate before joining.
- What is join skew?
Short answer: Uneven key distribution causes one node or partition to process far more data.
- Broadcast join vs shuffle join
Short answer: Broadcast sends small table to all workers; shuffle redistributes both sides by key.
- When should you use a broadcast join?
Short answer: When one side is small enough to replicate efficiently.
- Joining on multiple keys
Exercise: Write a join on (customer_id, region) and explain why using only one column is dangerous.
Example:
SELECT
c.customer_id,
c.customer_name,
o.order_id
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
AND c.region = o.region;
Why this matters: if customer_id is reused across regions or the business key is composite, joining on only customer_id can create incorrect matches.
- Non-equi join
Short answer: Join condition uses <, >, BETWEEN, or other non-equality conditions.
Mini example: join orders to a pricing-band table using amount BETWEEN min_amount AND max_amount.
- Join order and performance
Short answer: Filter early, shrink large tables before joining.
- Null behavior in joins
Short answer: NULL = NULL is not true, so null keys do not match in standard equality joins.
- How do you debug a bad join?
Checklist:
- check key uniqueness
- count rows before and after
- inspect nulls
- verify join type
- sample exploding keys
Quick debug query:
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
If this returns rows, you already know that joining customers to orders can create multiple rows per customer unless that is the intended output grain.
45β50. Practice prompts
- Find customers with no orders
- Find orders with missing product reference
- Join employee with manager using self join
- Explain many-to-many join explosion
- Compare EXISTS vs IN
- Reduce join cost on large datasets
C. Aggregations (51β75)
- COUNT(*) vs COUNT(column)
Short answer: COUNT(*) counts all rows; COUNT(column) ignores nulls.
- What does GROUP BY do?
Short answer: Groups rows by one or more columns before aggregation.
- WHERE vs HAVING
Short answer: WHERE filters rows before grouping; HAVING filters groups after aggregation.
- Conditional aggregation
Short answer: Use SUM(CASE WHEN β¦ THEN 1 ELSE 0 END) or similar.
SELECT
department,
SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) AS active_count
FROM employees
GROUP BY department;
- DISTINCT count
Short answer: COUNT(DISTINCT user_id) counts unique users.
- Approx distinct count
Short answer: Used for large-scale systems when exact distinct count is expensive.
- ROLLUP
Short answer: Adds subtotal and grand total rows.
- CUBE
Short answer: Produces all combinations of subtotals.
- GROUPING SETS
Short answer: Lets you explicitly choose multiple grouping combinations.
- Median
Short answer: Some engines support percentile_cont(0.5).
- Percentile
Short answer: Used to understand distribution, not just average.
- Why is AVG sometimes misleading?
Short answer: Outliers can distort it; median may be better.
- Nulls in aggregations
Short answer: Most aggregates ignore nulls except COUNT(*).
- Aggregation after join trap
Short answer: If a join duplicates rows, your sums and counts become wrong.
65β75. Practice prompts
- Daily active users
- Revenue by region
- Monthly cumulative sales
- Ratio of active to total users
- 95th percentile latency
- Distinct customers by month
- Product revenue share
- Top category by revenue
- Countries with declining growth
- Average order value
- Fraud count by severity
D. CTEs and Subqueries (76β100)
- What is a CTE?
Short answer: A named temporary result set used inside a query.
- Why use CTEs?
Short answer: Readability, reusability, step-wise logic.
- Recursive CTE
Short answer: A CTE that references itself, used for hierarchies or graph-like traversal.
- Subquery vs CTE
Short answer: Similar logically; CTE is often easier to read and debug.
- Correlated subquery
Short answer: A subquery that depends on values from the outer query.
- EXISTS vs IN
Short answer: Both check membership; EXISTS is often clearer and can be better for correlated logic.
- Scalar subquery
Short answer: Returns one value, such as a max date.
- Common CTE interview pattern
Pattern:
- Filter base data
- Rank or aggregate
- Join results
- Final filter
84β100. Practice prompts
- Latest record per customer
- Org hierarchy using recursive CTE
- Orders above customer average
- Users with no transactions
- Products never sold
- Most recent login per user
- Monthly cohort extraction
- Nested CTE refactor
- CTE + window combination
- Recursive tree traversal
- Parent-child path building
- Subquery to CTE rewrite
- Use EXISTS for filtering
- Find max salary per department
- Find rows above department average
- Query decomposition strategy
- Explain readability vs performance trade-off
E. Advanced SQL (101β150)
For these, use the answer pattern below.
Short answer template: Define it in one sentence.
Long answer template: Explain use case, benefit, and one limitation.
Hands-on template: Write one query or pseudo-query.
- Deduplication strategies
- CDC query logic
- Snapshot vs incremental query
- Merge / upsert pattern
- Time travel query
- JSON parsing in SQL
- Arrays in SQL
- Struct / nested fields
- Regex filtering
- Data masking
- Row-level security
- Column masking
- Predicate pushdown
- Partition pruning
- File skipping
- Explain plan basics
- Cost-based optimization
- Materialized views
- Query caching
- Resource groups
- Query federation
- Trino pushdown behavior
- Snowflake clustering basics
- BigQuery partitioning basics
- Iceberg metadata tables
- Delta transaction log idea
- Null-safe comparisons
- Merge late-arriving data
- Validate counts after load
- Build audit queries
- Find broken foreign keys
- Data quality assertions
- Window performance concerns
- Join performance concerns
- Aggregation skew
- Snapshot isolation concept
- Query retry strategy
- Fault-tolerant SQL pipelines
- Access filters
- Secure data views
- Multi-tenant filtering
- Dynamic SQL risks
- SQL injection basics
- Query observability
- Debug wrong result sets
- Debug row explosion
- Debug null issues
- Query anti-patterns
- SQL interview best practices
- How to talk through a SQL solution live
Section 2 β Coding (151β300)
Use this answer format for all coding problems:
- brute-force idea
- optimized idea
- time complexity
- space complexity
- edge cases
- clean code
A. Arrays and Strings (151β180)
- Reverse a string
Short answer: Use slicing in Python or two pointers.
def reverse_string(s: str) -> str:
return s[::-1]
- Palindrome check
def is_palindrome(s: str) -> bool:
cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())
return cleaned == cleaned[::-1]
- Two Sum
Short answer: Use a hash map for O(n).
def two_sum(nums, target):
seen = {}
for i, n in enumerate(nums):
need = target - n
if need in seen:
return [seen[need], i]
seen[n] = i
return []
Understand it: store past values, not future values.
- Three Sum
Short answer: Sort + two pointers.
- Longest substring without repeating characters
Short answer: Sliding window + set/map.
- Maximum subarray
Short answer: Kadaneβs algorithm.
-
Merge sorted arrays
-
Remove duplicates from sorted array
-
Rotate array
-
Prefix sum
-
Difference array idea
-
Group anagrams
-
Valid anagram
-
String compression
-
Character frequency map
-
Substring search
-
Pattern matching
-
Sliding window max
-
Longest common prefix
-
Roman to integer
-
Integer to Roman
-
String to integer
-
Reverse words in sentence
-
Product of array except self
-
Find missing number
-
Best time to buy and sell stock
-
Majority element
-
Merge intervals
-
Insert interval
-
Meeting rooms overlap
For each one, do:
- write brute force
- optimize
- code it
- say time/space
B. Stack, Queue, Linked List (181β210)
- Valid parentheses
def is_valid(s: str) -> bool:
pairs = {')': '(', ']': '[', '}': '{'}
stack = []
for ch in s:
if ch in pairs.values():
stack.append(ch)
else:
if not stack or stack[-1] != pairs.get(ch):
return False
stack.pop()
return not stack
-
Min stack
-
Queue using two stacks
-
Stack using queues
-
Reverse linked list
class ListNode:
def __init__(self, val=0, next=None):
self.val = val
self.next = next
def reverse_list(head):
prev = None
curr = head
while curr:
nxt = curr.next
curr.next = prev
prev = curr
curr = nxt
return prev
-
Detect cycle in linked list
-
Merge two sorted linked lists
-
Find middle of linked list
-
Remove nth node from end
-
Reorder list
-
LRU cache concept
-
Circular queue
-
Deque use cases
-
Monotonic stack
-
Next greater element
-
Daily temperatures
-
Largest rectangle in histogram
-
Trapping rain water
-
Evaluate reverse polish notation
-
Sliding window with deque
201β210. Practice repeats and variations
C. Trees and Graphs (211β260)
- Binary tree traversal
Short answer:
- preorder: root left right
- inorder: left root right
- postorder: left right root
- BFS vs DFS
Short answer: BFS uses queue, DFS uses stack/recursion.
-
Maximum depth of binary tree
-
Invert binary tree
-
Validate BST
-
Lowest common ancestor
-
Level order traversal
-
Tree diameter
-
Path sum
-
Serialize and deserialize tree
-
Number of islands
-
Clone graph
-
Course schedule
-
Topological sort
-
Detect cycle in graph
-
Connected components
-
Shortest path in unweighted graph
-
Dijkstra basics
-
Union find basics
-
Bipartite graph check
231β260. Practice variants
Hands-on rule: Draw the tree or graph before coding.
D. Recursion, Backtracking, DP, Search (261β300)
-
Fibonacci recursion vs DP
-
Climbing stairs
-
Coin change
-
Longest common subsequence
-
Edit distance
-
Knapsack idea
-
Subsets
-
Permutations
-
Combination sum
-
N-Queens idea
-
Binary search
-
Search in rotated sorted array
-
Find first and last position
-
Median of two sorted arrays idea
-
Kth largest element using heap
-
Top K frequent elements
-
Heap basics
-
Recursion tree thinking
-
Memoization vs tabulation
-
State definition in DP
281β300. Mixed practice
Interview speaking tip: Always say:
- βLet me start with brute force.β
- βNow Iβll optimize.β
- βHereβs the complexity.β
- βLetβs test edge cases.β
Section 3 β Data Modeling (301β400)
Core Answer Formula
For every modeling question answer in this order:
- business goal
- grain
- dimensions
- facts
- keys
- update pattern
- quality checks
-
performance considerations
- Fact vs dimension
Short answer: Facts store measurable events; dimensions store descriptive context.
- Star schema
Short answer: Fact table in center with denormalized dimensions around it.
- Snowflake schema
Short answer: Dimensions are normalized into multiple related tables.
- When to use star vs snowflake
Short answer: Star for analytics simplicity and speed; snowflake for reduced redundancy and stricter modeling.
- What is grain?
Short answer: The exact level of detail represented by one row in a fact table.
Example: one row per order line, not one row per order.
-
Surrogate key vs natural key
-
SCD Type 1
-
SCD Type 2
-
SCD Type 3
-
Factless fact table
-
Degenerate dimension
-
Junk dimension
-
Conformed dimension
-
Bridge table
-
Role-playing dimension
-
Snapshot table
-
Accumulating snapshot
-
Periodic snapshot
-
Event table
-
User growth model design
User growth model sample answer:
- business goal: track new, active, retained users over time
- grain: one row per user per day in activity fact or one row per signup event
- facts: login_count, active_flag, session_count
- dimensions: date, user, geography, device
- metrics: DAU, WAU, MAU, cumulative users
321β340. Practice prompts
- Design ecommerce warehouse
- Design payments data model
- Design fraud event model
- Design vulnerability scoring model
- Design customer 360 model
- Design support ticket analytics
- Design employee attrition analytics
- Design subscription analytics
- Design IoT telemetry model
- Design clickstream model
- Design ad impressions model
- Design experiment metrics model
- Design inventory analytics
- Design order fulfillment analytics
- Design API usage analytics
- Design model for late-arriving data
- Design historical pricing model
- Design SCD for customer address
- Design audit trail schema
- Design data quality monitoring model
341β360. Modeling concepts
- Normalization
- Denormalization
- 3NF
- OLTP vs OLAP
- Partition strategy
- Clustering strategy
- Data skew considerations
- Null handling strategy
- Schema evolution
- Data versioning
- Data retention
- Archival strategy
- GDPR deletion considerations
- Multi-tenant design
- Hierarchical data modeling
- Graph modeling basics
- Semi-structured data modeling
- JSON column trade-offs
- Wide vs narrow tables
- Aggregated tables
361β400. Advanced practice prompts
- Lakehouse table design
- Iceberg partition evolution idea
- Hudi vs Delta vs Iceberg discussion
- Streaming event model
- Batch snapshot model
- Slowly changing relationships
- Data vault basics
- Domain-driven modeling
- KPI modeling
- Metric consistency layer
- Metadata model
- Data catalog entity model
- Ownership model
- Access control model
- Lineage model
- Compliance model
- Data contract model
- Feature store model
- Time-series partitioning
- Daily cohort model
- Retention cohort model
- Marketing attribution model
- Session fact design
- Revenue recognition model
- Refund adjustment model
- Currency conversion model
- Multi-region reporting model
- Slowly changing facts
- Event sourcing model
- CQRS idea in analytics
- Cross-domain metric reconciliation
- Data mart integration
- Semantic layer concept
- Snapshot reconciliation
- Late-arriving dimensions
- Missing dimension handling
- Unknown member row
- Historical restatement handling
- Modeling trade-offs
- How to explain a data model in interview
Section 4 β Data Architecture (401β500)
Core Answer Formula
For architecture questions use this structure:
- requirements
- scale
- latency
- data sources
- ingestion
- storage
- processing
- serving
- governance
- observability
- failure handling
-
trade-offs
- Batch vs streaming
Short answer: Batch processes data in chunks; streaming processes continuously with low latency.
Long answer: Choose batch when cost and simplicity matter more than immediacy. Choose streaming when decisions must be made in near real time, such as fraud or threat detection.
-
Kafka basics
-
Topic, partition, offset
-
Producer vs consumer
-
Consumer group
-
Exactly-once concept
-
At-least-once vs at-most-once
-
Backpressure
-
Dead-letter queue
-
Idempotency
-
CDC pipeline design
-
API ingestion design
-
Batch pipeline design
-
Real-time alerting pipeline
-
Lambda vs Kappa
-
Spark architecture basics
-
Trino architecture basics
-
Iceberg basics
-
Delta basics
-
Lakehouse concept
421β440. Governance and scale
- Data lake design
- Data warehouse design
- Metadata layer
- OpenMetadata concept
- Data lineage
- Data catalog
- Data contracts
- RBAC
- ABAC
- OPA policy engine concept
- Row-level security
- Column-level security
- Encryption at rest and in transit
- Masking and tokenization
- Audit logging
- Monitoring and observability
- SLA vs SLO
- Reliability design
- Fault tolerance
- High availability
441β460. Platform design practice prompts
- Design a real-time fraud platform
- Design a vulnerability intelligence pipeline
- Design a clickstream analytics platform
- Design a customer 360 lakehouse
- Design a federated query layer
- Design a metadata-driven governance system
- Design a CDC ingestion system
- Design an API-to-lake ingestion system
- Design a cost-optimized analytics platform
- Design multi-tenant data platform
- Design low-latency dashboard backend
- Design event replay system
- Design schema evolution strategy
- Design data quality framework
- Design lineage and audit system
- Design cross-region data platform
- Design hybrid cloud data stack
- Design role-based access at query layer
- Design row filters for sensitive data
- Design platform for 1000+ users and 200k queries/day
461β480. Infra and operations
- Kubernetes basics for data platform
- Autoscaling
- Resource isolation
- Multi-cluster strategy
- Deployment strategies
- Blue-green deployment
- Canary deployment
- CI/CD for data platform
- Secrets management
- Disaster recovery
- Backup and restore
- Chaos testing idea
- Performance tuning
- Throughput vs latency
- Caching strategies
- Materialized views
- Query acceleration
- File compaction
- Small files problem
- Partition evolution
481β500. Executive-level trade-off questions
- Batch vs stream trade-offs
- Spark vs Trino trade-offs
- Iceberg vs Delta trade-offs
- Open table format benefits
- Federation vs ingestion trade-offs
- Build vs buy for metadata platform
- Lakehouse vs warehouse trade-offs
- Pushdown vs centralized compute
- Governance vs developer agility
- Cost vs latency
- Reliability vs speed of delivery
- Multi-tenant isolation trade-offs
- Real-time vs simplicity
- API integration challenges
- Schema drift handling
- Platform observability priorities
- Security-first architecture
- Migration from legacy Hadoop to modern stack
- How to present architecture in interview
- How to answer follow-up challenges confidently
Rapid Hands-On Exercise Plan
SQL Drills
- Create 5 small tables in any SQL environment.
- Practice 3 joins, 3 windows, 3 aggregations daily.
- For every query, explain output before running it.
- After running it, validate row count and logic.
Coding Drills
- Solve 10 core problems only, repeatedly.
- Write in Python from memory.
- For each problem say:
- brute force
- optimized
- complexity
- edge cases
- Recode without looking.
Modeling Drills
- Pick one business case.
- Define grain first.
- Draw fact and dimensions on paper.
- Explain late-arriving data and history.
- Explain quality checks.
Architecture Drills
- State requirements first.
- Draw ingestion, storage, compute, serving.
- Add governance and monitoring.
- Add failure scenarios.
- Add trade-offs.
Final 3-Day Prioritization
If you cannot finish all 500, do these first:
Must-Win SQL
- windows
- joins
- aggregation
- CTE
- dedup
- sessionization
Must-Win Coding
- arrays
- strings
- hashmap
- stack
- queue
- linked list
- tree traversal
- BFS/DFS
- sliding window
- binary search
Must-Win Modeling
- grain
- fact vs dimension
- star vs snowflake
- SCD2
- snapshots
- partitioning
Must-Win Architecture
- batch vs stream
- Kafka basics
- lakehouse
- Trino/Spark roles
- governance
- API ingestion
- CDC
- fault tolerance
Interview Speaking Rules
- Start simple.
- State assumptions.
- Give brute force first for coding.
- Give trade-offs for architecture.
- Use one real example from your work whenever possible.
- Do not jump into code before explaining.
- Validate with a small test case.
- End with complexity or trade-off summary.
Personal Focus For You
Because you said Java is stronger and Python is weaker:
- Think solution in Java if needed
- Speak algorithm in plain English
- Code in simple Python
- Avoid clever Python tricks
- Use readable loops, dicts, lists, sets
Example interview line:
I usually think in Java-style data structures, but Iβll code this in simple Python for speed and readability.
That is completely acceptable.