Can you walk me through one system you’ve built recently that you’re most proud of — and explain the architecture and your role in it?
Sure — one of the systems I’m most proud of is a federated data platform I built to improve vulnerability response across the organization.
The problem I saw was that critical software vulnerabilities were constantly emerging, but the response process was very slow. It could take days to weeks to identify which systems were affected, who owned them, and where that software was running.
The core issue was that the data I needed — asset inventory system, vulnerability scanner results, and ownership metadata — was spread across multiple siloed systems, and there was no reliable way to correlate it quickly.
Initially, I tried a centralized data lake approach, but that didn’t work well. Data became stale, pipelines introduced delays, and I lost trust compared to source systems.
So I led the design of a federated data platform using Trino as a SQL-on-anything engine. Instead of moving data, I allowed teams to query source systems directly in near real time.
From an architecture perspective:
- Trino acted as the query engine across multiple sources
- Data remained in systems like vulnerability scanners, asset systems, and operational databases
- I used Iceberg + S3 for curated datasets where persistence was needed
- OpenMetadata was used to manage metadata, lineage, and ownership
- OPA was integrated for policy-based access control at query time
My role was end-to-end:
- I drove the architectural shift from centralized to federated
- Designed the integration between Trino, metadata, and governance layers
- Worked closely with security, platform, and governance teams
- And ensured I delivered this incrementally, starting with high-impact datasets
One of the key challenges was balancing real-time access with governance. Direct access improves speed, but without proper controls, it creates risk. So I implemented metadata-driven policies using OpenMetadata and OPA to enforce access dynamically based on ownership and sensitivity.
The outcome was significant:
- I reduced vulnerability response time from weeks to near real time
- Security teams could quickly identify impacted systems and owners
- I improved trust by querying source-of-truth systems directly
- And reduced infrastructure cost by avoiding unnecessary data duplication
What I really value about this system is that it transformed the platform from a reporting layer into an actionable intelligence system — enabling faster and more informed security decisions.
Okay, that sounds interesting. But I want to understand your decision-making. Why did you choose a federated approach with Trino instead of fixing the data lake? Most companies invest heavily in centralized platforms — why move away from that?
That’s a great question — and I actually did try to improve the centralized approach before moving to federation. At that time, I was using a Cloudera-based platform with Hive and Impala for querying, and Spark jobs for ingestion and transformation. All pipelines were scheduled through Autosys. The challenge wasn’t just performance — it was operational hardness and latency.
Every time I needed new data or a change in logic:
- I had to modify Spark jobs
- Update ingestion pipelines
- Go through scheduling cycles in Autosys
- And wait for the next run
So even small changes could take hours or days to reflect.
Also, the infrastructure itself was constrained — Spark jobs were running on limited cluster capacity, which created bottlenecks during peak times.
For a use case like vulnerability response, this model simply didn’t work because I needed:
- Immediate access to data
- Flexibility to query across systems dynamically
- And the ability to adapt quickly without pipeline changes
That’s where the federated approach came in.
With Trino, I shifted from a pipeline-driven model to a query-driven model:
- No need to build or modify Spark jobs
- No dependency on scheduling tools like Autosys
- Teams could directly query source systems using SQL
So instead of waiting for data to be prepared, I enabled on-demand access. That significantly reduced time-to-insight and removed a lot of operational overhead. I still kept curated datasets in Iceberg for historical use cases, but for real-time decision-making, federation was far more effective. So the decision was really about moving from a hard, pipeline-heavy architecture to a flexible, on-demand access model, which aligned much better with the needs of the business.
Okay, but if anyone can run SQL directly on source systems, how did you control access and ensure sensitive data wasn’t exposed?
That was actually one of the most critical concerns when I moved to a federated model. Direct access to source systems increases speed, but without proper controls, it can introduce serious security risks. So instead of relying on static access models, I implemented a metadata-driven governance approach.
I used OpenMetadata to capture:
- Data ownership
- Data classifications, Sensitivity classifications (like PII, critical systems)
- Lineage and business context
Then I integrated Open Policy Agent (OPA) as a centralized policy engine.
At query time, whenever a user executed a query through Trino:
- The request was evaluated against policies in OPA
- Policies were dynamically applied based on:
- User identity and role
- Data sensitivity
- Ownership and domain context
This allowed me to enforce fine-grained controls like:
- Row-level filtering
- Column masking
- Access restrictions based on context
One important design decision I made was to externalize policies from the data platform.
So instead of embedding access logic inside Trino or pipelines:
- Policies were centrally managed
- Version-controlled
- And consistently applied across all data access
I also ensured full auditability — every query and access decision could be traced, which was important for compliance and security teams. So even though I enabled faster, direct access, governance was actually stronger and more flexible than before.
This sounds good, but it also sounds complex. How did you ensure teams actually adopted this model instead of bypassing it and going back to their own systems?
That’s a very real concern — and honestly, adoption was one of the hardest parts of this initiative. I knew that if the platform added friction, teams would bypass it and go back to their own tools. So I focused heavily on making the platform both useful and easy to adopt. I approached this in three ways:
First, I delivered immediate value. I onboarded a few high-impact use cases — especially around critical vulnerability response — and showed that teams could get answers in minutes instead of days. That created strong pull from users.
Second, I made it familiar. Instead of introducing new tools, I allowed teams to use SQL through Trino, which most engineers and analysts were already comfortable with. So the learning curve was minimal.
Third, I built trust through governance and transparency. By integrating metadata and policy enforcement, teams could clearly see:
- What data they could access
- Why certain data was restricted
- And who owned the data
That transparency reduced resistance from both users and governance teams.
I also worked closely with stakeholders across security, data, and platform teams to define clear ownership and SLAs, so everyone knew their responsibilities. And importantly, I didn’t force a big-bang migration. I allowed teams to adopt the platform incrementally, while gradually deprecating older pipelines where it made sense. Over time, as teams experienced faster insights and fewer operational bottlenecks, adoption became organic rather than enforced.
Great. Last question from my side. What would you do differently if you were to build this system again?
That’s a great question — and looking back, there are a couple of things I would do differently.
First, I would invest earlier in the semantic and metadata layer.
I focused initially on enabling access and performance, but as adoption grew, I realized that discoverability and context became equally important. If I were to do it again, I would prioritize building a stronger semantic layer upfront — making it easier for both users and AI-driven systems to understand and navigate the data.
Second, I would formalize the governance model earlier in the journey.
I introduced metadata-driven policies with OPA and OpenMetadata, but doing that earlier would have reduced some of the initial friction with governance teams and accelerated adoption.
Third, I would design more explicitly for AI and agent-based interaction from day one.
The platform naturally evolved into an actionable intelligence layer, but now with the rise of AI agents, I would structure it so that agents can:
- Discover data through metadata
- Understand context
- And safely interact with systems using governed access
So overall, the core architecture was solid, but I would bring forward:
- Semantic modeling
- Governance maturity
- And AI-readiness
earlier in the lifecycle.
Because increasingly, the value of a data platform is not just in storing or querying data, but in enabling intelligent, automated decision-making on top of it.