| Infinite Federation

Perfect — this is gold-level insight into how Salesforce actually interviews. Now I’ll convert this into a real preparation playbook tailored to YOU, with:

✅ Exact deep technical questions they will ask ✅ Best possible answers (staff-level) ✅ Follow-up traps (what caused rejection for others) ✅ Focus on low-level data engineering + system design + SQL + pipelines

⸻

🚨 What Went Wrong for Others (IMPORTANT)

From what you shared, candidates got rejected because:

❌ “Better candidates” → means: • deeper system thinking • stronger SQL reasoning • more hands-on clarity

👉 NOT because of: • lack of experience • lack of tools

⸻

🎯 What Salesforce REALLY Tests

Across all rounds:

Area Depth SQL 🔥🔥🔥 Very Deep Python Medium Data Pipeline 🔥🔥🔥 System Design 🔥🔥 Data Modeling 🔥🔥🔥 Problem Solving 🔥🔥🔥

⸻

🚀 1. DATA WAREHOUSING / INGESTION (DEEP QUESTIONS)

⸻

❓ Q1: How do you extract data from multiple sources into a warehouse?

❌ Weak answer

“Use ETL tools / pipelines”

💥 Strong answer

“I design ingestion based on source type:

•	APIs → incremental pull with watermark
•	DB → CDC (log-based preferred)
		CDC (Change Data Capture) is a technique to capture only the changes (INSERT, UPDATE, DELETE) from a data source instead of reloading the full dataset.
		Why CDC is Needed
			❌ Without CDC
				•	Full table load every time
				•	Expensive
				•	Slow
				•	Not real-time
			
			✅ With CDC
				•	Only changes are processed
				•	Efficient
				•	Near real-time pipelines
				•	Scalable
				•	Files → batch ingestion with validation

Then I standardize ingestion with: • schema contracts • metadata capture • data quality checks

Finally, I load into staging → transform → curated layers.”

⸻

🔥 Follow-up trap

👉 “What if source system sends duplicate data?”

💥 Answer:

“I always design deduplication using business keys + timestamps. CDC pipelines especially require explicit dedup logic.”

⸻

❓ Q2: Given huge data, how do you ensure only relevant data is loaded?

💥 Strong answer • Filter at source (pushdown) • Incremental loads • Partition-based ingestion

👉 Say:

“Full loads are rarely scalable — incremental and filtered ingestion is critical.”

⸻

⚙️ 2. AIRFLOW DAG (VERY COMMON)

⸻

❓ Q3: How do you configure an Airflow DAG?

💥 Strong answer

“A DAG defines workflow dependencies. Core components:

•	DAG definition (schedule, retries)
•	Tasks (operators)
•	Dependencies (task order)
•	Retry + failure handling

I also ensure: • idempotent tasks • proper logging • alerting”

⸻

🔥 Follow-up trap

👉 “What happens if a task fails?”

💥 Answer: • Retry mechanism • upstream/downstream dependency handling • manual rerun

⸻

🔥 Deep question

👉 “How do you prevent duplicate runs?”

💥 Answer:

“Use execution_date + idempotent logic + task-level checks.”

⸻

🧠 3. DATA MODELING (VERY IMPORTANT)

⸻

❓ Q4: How do you design a data model?

💥 Strong answer

“I start from business requirements, then identify:

•	entities
•	relationships
•	grain of data

Then: • normalize for OLTP • denormalize for analytics

For warehouse: • star schema (facts + dimensions) • SCD handling for history”

⸻

🔥 Follow-up trap

👉 “When do you NOT use star schema?”

💥 Answer: • highly normalized systems • graph-like relationships • real-time systems

⸻

❓ Q5: How do you model historical changes?

💥 Answer: • SCD Type 2 • effective_date, expiry_date

⸻

🔄 4. QUERY OPTIMIZATION (VERY IMPORTANT)

⸻

❓ Q6: How do you optimize queries?

💥 Strong answer • Partition pruning • Predicate pushdown • Avoid SELECT * • Proper joins • Indexing / metadata pruning

⸻

🔥 Follow-up trap

👉 “Why is query slow even with indexes?”

💥 Answer: • wrong join order • full scan • data skew

⸻

🔗 5. SQL (THEY WILL GO DEEP)

⸻

❓ Q7: Difference between joins?

👉 DO NOT give textbook answer

💥 Strong answer • INNER → matching rows • LEFT → all left + matched right • RIGHT → opposite • FULL → all rows

👉 ADD:

“Choice of join impacts performance and result correctness.”

⸻

🔥 Follow-up trap

👉 “When does LEFT JOIN behave like INNER JOIN?”

💥 Answer:

“When filter is applied on right table columns in WHERE clause.”

⸻

❓ Q8: How do you handle duplicates in SQL?

💥 Answer: • ROW_NUMBER() • GROUP BY • DISTINCT (last resort)

⸻

❓ Q9: Window functions?

💥 Answer: • ranking • running totals • partition-based calculations

⸻

🐍 6. PYTHON ROUND

⸻

❓ Q10: Process large dataset in Python?

💥 Answer: • streaming (generators) • chunk processing • avoid loading in memory

⸻

🔥 Follow-up trap

👉 “Why not pandas?”

💥 Answer:

“Pandas is memory-bound — not suitable for large-scale processing.”

⸻

⚠️ 7. REAL SYSTEM DESIGN QUESTIONS

⸻

❓ Q11: Design a pipeline for HR data (Recruiting + Employee)

💥 Strong answer • ingestion (API + DB) • CDC • storage (warehouse) • transformation (dbt) • serving (BI + AI)

⸻

❓ Q12: How do you handle failure in pipeline?

💥 Answer: • retry • checkpoint • idempotency • DLQ

⸻

❓ Q13: How do you ensure data quality?

💥 Answer: • validation rules • monitoring • alerts

⸻

🧨 8. HARD QUESTIONS (WHERE PEOPLE FAIL)

⸻

❓ Q14: Pipeline succeeded but data is wrong — what do you do?

💥 Answer:

“Success status ≠ correctness. I validate data using quality checks and compare with source.”

⸻

❓ Q15: How do you debug data issue?

💥 Answer: • trace lineage • compare source vs target • check transformations

⸻

🏆 9. PROBLEM-SOLVING QUESTION

⸻

❓ Q16: How do you approach complex problems?

💥 Best answer structure: 1. Break problem 2. Identify constraints 3. Design solution 4. Validate 5. Iterate

⸻

🎯 10. DIRECTOR ROUND EXPECTATION

⸻

❓ Q17: How do you prioritize work?

💥 Answer: • business impact • urgency • dependencies

⸻

❓ Q18: Handling multiple projects?

💥 Answer: • clear prioritization • stakeholder alignment • iterative delivery

⸻

🔥 FINAL: WHAT YOU MUST DO TO WIN

⸻

💥 Your Advantage

You already have: • Distributed systems • AI + RAG • Metadata-driven architecture • API-first systems

⸻

🚨 What you MUST improve

Based on feedback patterns:

👉 SQL depth (VERY IMPORTANT) 👉 Low-level pipeline clarity 👉 Simple explanations (don’t over-architect)

⸻

🏆 FINAL KILLER LINE

Use this:

“I focus on building data systems that are not just scalable, but correct, observable, and trusted — because wrong data at scale is more dangerous than slow data.”

⸻

🚀 NEXT STEP

Now let’s simulate REAL Salesforce Round 2

👉 I will ask you:

1 SQL problem 1 pipeline failure problem 1 system design

You answer → I grill you like real interviewer.

Just say:

👉 “Start round 2”