Writing · Data Infrastructure

Real-Time Data Platforms Are A Moat. Most Companies Treat Them Like A Cost Center.

Real-time data is the foundation, not a feature. Companies that get it right compound advantage. The rest run on stale signal and never figure out why.

Most companies that say they have a real-time data platform have a near-real-time data platform with a marketing problem. The events land in Kafka in two seconds, sit in a queue for thirty seconds, get processed by a job that runs every five minutes, and end up in a warehouse that the dashboards refresh hourly. That is not real-time. That is batch with extra steps.

The companies that get real-time right compound advantage every quarter. They make better decisions faster, they catch issues before customers do, they personalize at the moment of intent, and they bid in markets where the next-best competitor is operating on five-minute-old data. The rest run on stale signal and wonder why their decisioning gets out-competed.

Real-time data infrastructure is one of the highest-leverage moats a company can build, and one of the most under-invested. Here is how I think about it.

The latency hierarchy

Not all "real-time" is the same. There are three tiers and they unlock different decisions. Knowing which tier you actually need keeps you from over-engineering or under-engineering.

Batch is anything over five minutes from event to action. Hourly dashboards. Overnight ETL. Daily personalization batches. This is fine for retrospective analysis, executive reporting, and anything where the decision can wait. Most "data warehouses" live here. The cost is low, the engineering is well-understood, and ninety percent of analytical workloads do not need anything faster.

Micro-batch is roughly thirty seconds to five minutes. Streaming aggregations on five-minute windows. Near-real-time alerting. Same-session personalization where you can tolerate a small delay. This is where most "real-time" stacks actually live, and for many use cases — fraud triage, ops alerting, content recommendations — it is genuinely sufficient.

True streaming is sub-second, ideally sub-200ms end-to-end from event capture to action. Real-time bidding. Voice systems. Fraud blocking at point-of-sale. In-session ML inference. Algorithmic trading-adjacent workflows. This is expensive, hard, and only worth doing if the decision is genuinely time-sensitive. But if you need it, micro-batch is not a substitute. The decisions you can make at 200ms are categorically different from the decisions you can make at 30 seconds.

Why most "real-time" stacks aren't

The most common failure pattern: a team installs Kafka, writes some Flink jobs, points a few dashboards at materialized views, and declares the platform real-time. The events land fast. Everything downstream is slow.

Three things break the chain. First, the consumer is a job scheduler, not a stream consumer. The data lands in Kafka in 50ms, but the dbt model that aggregates it runs every fifteen minutes. Latency is whatever the slowest hop in the pipeline is. Second, the storage layer is a warehouse, not a streaming-first store. Snowflake and BigQuery are extraordinary tools, but they are not real-time. Querying them with a fresh result requires a full COPY-and-refresh, and the round trip is rarely under a few seconds at meaningful scale. Third, there is no real-time serving layer. The data exists somewhere fresh, but the application accessing it is hitting a stale cache or a nightly snapshot.

Real-time is end to end. If any single hop is slow, the whole pipeline is slow. The math does not negotiate.

The five things a real real-time platform must do

I have built versions of this enough times that I have a checklist. If a platform is missing any of these, it will fail in production no matter how clean the architecture diagram looks.

Event capture with strong delivery semantics. Every event has a unique id, a source-of-truth timestamp, and at-least-once delivery into the bus. Producers retry on failure. The bus persists events for at least seven days for replay. Anything weaker than this and you cannot reason about the system.

Schema evolution that does not break consumers. Producers will change. Fields get added, deprecated, renamed. The schema registry is non-negotiable. Backward and forward compatibility rules are enforced at write time. The system rejects incompatible schemas at the producer, not by quietly losing data downstream.

State that survives. Streaming jobs accumulate state — running counts, joins, sessionization windows. That state has to be checkpointed durably and recoverable on a clean restart. RocksDB-backed state with checkpoints to S3 is the boring, working answer for Flink. Whatever your engine, the question is the same: when this job dies, can it pick up where it left off without losing data or producing duplicates.

Replay. The single most important capability. Every meaningful streaming pipeline needs to be replayable from any point. That means events live on the bus long enough, transformations are deterministic, and you can run the entire pipeline from a backfill point and produce the same output. Without replay you cannot fix bugs, you cannot recover from corruption, and you cannot test changes safely. Most platforms ship without replay and discover the cost the first time something goes wrong.

Observability that is itself real-time. If your monitoring is fed by the same warehouse that takes ten minutes to refresh, you will be the last to know your pipeline is broken. Lag metrics, throughput, error rates, and consumer health need to be observable in seconds. SLOs on freshness are tracked alongside SLOs on uptime.

Where the value actually lives

Most teams overinvest in the streaming engine and underinvest in everything around it. Whether you run Kafka, Redpanda, Pulsar, or Kinesis, the choice matters less than people pretend. They all work. They all have rough edges. The differences are real but second-order.

The real engineering work is in the layers above and below. The schema registry and contract testing. The CDC layer that gets data out of operational databases without falling over under load. The materialized views or stream-table duality layer that makes streamed data queryable in milliseconds. The serving layer — typically a fast online store like Redis, ScyllaDB, or DynamoDB — that the application actually hits. The replay tooling. The lineage tracking. The cost monitoring.

I have seen teams spend a year on a Kafka-vs-Pulsar bake-off and ship nothing. I have seen teams pick whichever bus their cloud provider gives them, spend that year building the layers around it, and end up with a platform that runs the company. Pick the engine, then move on.

The streaming engine is rarely the moat. The moat is what you do with the events once they land — the contracts, the joins, the serving layer, the replay tooling, the discipline.

Operating cost discipline

Real-time platforms are expensive if you let them be. The cost trap is broadcasting every event to every consumer at every fanout. I have seen platforms run six figures a month on data volume that should cost ten thousand.

A few patterns keep cost sane. Topic partitioning that matches consumer parallelism — over-partitioning multiplies overhead. Retention tuned to actual replay requirements, not "just in case" — most teams retain everything forever and pay for storage that nobody reads. Compression at the producer. Compaction on slowly-changing topics. And, importantly, separating hot streams from cold streams. The events that drive sub-200ms decisions live on a fast tier with aggressive retention. The events that drive analytics flow to a slower tier with cheaper storage. One bus, two service levels.

I budget real-time infrastructure as a percentage of revenue. For an operationally serious business, two to four percent of revenue on the entire data platform is reasonable. If you are above that and the platform is not directly driving decisions, something is mis-sized.

What real-time unlocks

Done right, a real-time platform changes what decisions are even on the table. In-session personalization that responds to behavior in milliseconds, not next-session. Fraud blocking before the transaction completes, not detection after. Real-time bidding decisions on auction-level signal. Operational alerts that fire before customers notice the issue. Closed-loop ML systems where the model retrains on the data it produced an hour ago, not last week.

None of those are dashboard improvements. They are product capabilities that competitors on a five-minute pipeline cannot copy without rebuilding their foundation. That is what makes real-time a moat.

Treat it like infrastructure, not a feature. Staff it. Budget it. Run SLOs on freshness with the same seriousness you run SLOs on uptime. Build the layers around the engine, not just the engine. And replay everything, always.

The teams that do this end up with a platform that pays for itself in product surface area. The teams that don't end up with a Kafka cluster nobody trusts and a stack of dashboards that lie to them six hours a day.

How to know you actually need it

I get asked all the time whether a given company should invest in real-time. Most of the time the honest answer is not yet. Real-time has a real fixed cost — engineering, operations, on-call burden. If your decisions are not actually time-sensitive, you are paying for capability you will never use, and the maintenance will quietly degrade until you are running a real-time platform that produces near-real-time results, which is the worst of both worlds.

Three tests tell you the truth. The first is whether the cost of acting one minute late is materially different from the cost of acting one second late. If the answer is no, you do not need streaming. The second is whether you have a closed loop — does an action you take now generate signal that improves the next action within the same session. If you have a loop, freshness compounds. If you don't, freshness is vanity. The third is whether the volume justifies the engineering. Streaming makes sense above roughly a hundred thousand events per minute. Below that, a tight micro-batch on a five-second cadence is usually faster to ship, cheaper to run, and indistinguishable from streaming for most decisioning.

If two of those three are true, build it properly and treat it as a strategic asset. If only one is true, run micro-batch and revisit when the workload grows. The wrong investment for the wrong workload is the most common reason real-time platforms get reputations as money pits.

Ajit Samuel is a New York City based founder and operator. He architects, ships, and operates production AI, agentic systems, real-time data platforms, advertising technology, and growth infrastructure. ajitsamuel.com.