Service Mesh in Kubernetes: Istio, Linkerd, or Cilium — When the Business Needs It

A service mesh is not a mandatory next step after Kubernetes, nor is it a universal improvement to add “just in case.” It is needed when internal calls between services become a business risk in their own right: incidents become harder to investigate, releases become harder to roll out safely, access becomes harder to control, and it becomes harder to prove which service interacted with which.

The simplest way to think about a service mesh is through the problems it solves:

  • MTLS — encryption and mutual authentication of services;
  • Traffic splitting — gradual deployment of versions, canary releases, and quick rollback;
  • Retries and timeouts — unified rules for retries and wait times across services;
  • Observability — map of interservice calls, errors, latency, and request traces;
  • Zero trust — access is not based on “being on the network,” but on the service identity;
  • Multi-service architecture — managing dependencies as the number of services and teams grows.

But these capabilities come at a cost. A mesh adds new components, overhead, diagnostic procedures, and responsibilities for the platform team. If there are few services, service relationships are simple, releases are infrequent, and the current capabilities of Kubernetes, Gateway API, NetworkPolicy, Prometheus, and OpenTelemetry are sufficient, a service mesh may be a premature complication.

However, if internal calls affect payments, orders, access permissions, customer SLAs, or audit requirements, a mesh becomes a justified option. In that case, the choice should not be the “most powerful” tool, but the ownership model: Istio for flexibility and complex enterprise scenarios, Linkerd for an easier path into service mesh, and Cilium when networking and security are already built around Cilium/eBPF.

When a service mesh becomes a business concern

Kubernetes is already running, the number of services is growing, teams are releasing independently, and incidents are increasingly hidden not in a single application but in a chain of internal calls. The user sees one thing: an operation has become slow or unstable. The internal team sees something else: the API is waiting on billing, billing is waiting on permissions, permissions depends on storage, and the exact cause of the degradation gets lost between services.

This is usually when the idea comes up: deploy a service mesh and centrally manage the security, routing, resilience, and observability of internal traffic. The idea may be sound, but only if the problem really lies in the interactions between services, rather than in the basic configuration of Kubernetes, monitoring, or the release process.

A mesh is worth considering when internal calls are already affecting business metrics:

SituationWhy a mesh can help
Incidents are difficult to investigateVisibility into call chains, latency, and errors between services is needed
Releases have become riskyCanary releases, traffic splitting, and fast rollback without manual routing are needed
Security requirements are increasingmTLS, proxies, service identity-based access, and movement toward zero trust are needed
Retries and timeouts are configured inconsistentlyConsistent retry/timeout rules between services are needed
There are now many services and teamsA shared model for managing internal traffic is needed

But the downside is important as well. A service mesh sits in the critical path of requests. If access rules are misconfigured, services may lose access to each other. If retries are too aggressive, a failure can be amplified by a cascade of retries. mTLS can affect performance and network latency. If the team does not know how to diagnose the control plane, data plane, sidecars, or the eBPF layer, investigations will become harder, not easier.

So the practical question is not “which mesh is better,” but “are we ready to own this layer?” Before choosing Istio, Linkerd, or Cilium, you need to answer three questions honestly: which problems are not addressed by the current tools, how much complexity the mesh will add, and who will be responsible for it in the production environment.

The next logical step is to examine exactly what a service mesh adds to Kubernetes, and then distinguish the situations where it is genuinely needed from the cases where it is simpler and safer to do without it.

What a service mesh adds to Kubernetes

Where Kubernetes’ basic capabilities end

Kubernetes handles the infrastructure layer well. It schedules containers, restarts them after failures, provides Services, DNS, internal network connectivity, basic load balancing, and mechanisms for inbound traffic through Ingress or the Gateway API.

But Kubernetes by itself does not act as a dispatcher for every internal HTTP/gRPC call. It can provide connectivity between api and billing, but it typically does not answer more application-level questions:

  • Who exactly is allowed to call this service;
  • Whether the connection needs to be encrypted and mutually authenticated;
  • Which timeout to apply;
  • How many times a request may be retried;
  • How to split traffic between the old and new versions;
  • Where exactly latency increased in the call chain.

This is where a service mesh comes in. It adds a management layer for internal traffic between services. This traffic is often called east-west traffic: these are calls inside the platform that users do not see directly, but they directly affect the speed and reliability of a business operation.

Ingress, API Gateway, and the Gateway API more often operate at the edge of the system, handling traffic from users, partners, or external systems. This traffic pattern is called north-south traffic. A service mesh usually addresses a different problem: not “how a user got into the cluster,” but “how services inside the cluster communicate with each other securely and predictably.”

Which functions a mesh moves into the infrastructure

In practice, a mesh adds several capabilities:

  • MTLS — encryption and mutual authentication of services;
  • Access policies — permissions based on the service identity, not just IP;
  • Traffic splitting — splitting traffic between versions for canary releases;
  • Retries and timeouts — unified rules for retries and waiting;
  • Observability — visibility into call chains, latency, and errors between services.

mTLS matters here not as “yet another layer of encryption.” Its purpose is that both parties verify each other. The client service confirms that it is calling the right service, while the receiving service knows who is calling it. In Kubernetes, this is easier to build around service identity or a service account than around IP addresses, which change constantly.

Consider a B2B SaaS application where a single user operation passes through api, billing, permissions, notifications, and storage. To the customer, the problem looks simple: the page has started loading more slowly. But the cause may be deeper—for example, permissions may be waiting for a response from storage, causing latency to build up in api.

Kubernetes will show that the pods are running, Services are available, and CPU and memory are within normal ranges. But that may not be enough to quickly understand which internal call has degraded and how it affects the entire chain. A service mesh adds exactly this view: who calls whom, where latency is increasing, where errors occur, and which rules are applied to the traffic.

The main benefit of a mesh is that some networking logic no longer has to be implemented differently in every service. Retries, timeouts, tracing, mTLS, and access rules become more consistent. Releases can be carried out more cautiously through traffic splitting, and incident investigations can rely not only on resource metrics but also on a map of interservice dependencies.

This does not mean that Kubernetes “can’t handle it.” It solves its own problem: container orchestration and the basic networking model. A service mesh becomes useful when communication between services itself becomes a separate area of management.

When a service mesh is truly needed—and when it is not

After reviewing what a service mesh can do, it is important not to equate “we have more services” with “it is time to adopt a mesh.” The decision depends not on the number of services itself, but on the cost of errors in how they interact.

Fifty services with infrequent, simple connections can run perfectly well without a mesh. But ten services in a payment, healthcare, or B2B environment may already require dedicated control over who calls whom, how access is verified, how versions are rolled out, and where to look for latency in the request chain.

The easiest way to assess whether a mesh is needed is to look at several indicators:

IndicatorA mesh is needed if…A mesh may be unnecessary if…
Internal callsFailures in call chains affect customers and SLAsConnections are simple and well controlled
SecuritymTLS, identity-based access, and zero trust are requiredNetworkPolicy and basic segmentation are sufficient
ReleasesCanary releases, traffic splitting, and fast rollback are requiredReleases are infrequent, and versions are switched over all at once
ObservabilityA map of dependencies, latency, and errors between services is requiredPrometheus, OpenTelemetry, and logs already provide enough data
TeamA platform team is ready to own the mesh as a critical layerMaintenance will fall to developers who lack the time and expertise

This table is best read not as a formal checklist, but as a maturity assessment. If the “needed” indicators appear in only one area, a mesh may be too heavyweight a solution. If they recur across security, releases, observability, and team ownership, the problem is no longer an individual service, but the connections between services.

When You Can Do Without a Mesh

If most of the characteristics fall into the right-hand column, choosing not to use a mesh is not a sign of technical backwardness. It is a reasonable decision that reduces operational risk.

In an early-stage product with only a few services, it is often more sensible to start with a simpler stack: Gateway API for inbound traffic, NetworkPolicy for network permissions, Prometheus for metrics, OpenTelemetry for tracing, and clear rules in the code.

This approach does not provide all the capabilities of a mesh, but it also avoids adding a new critical layer too early. However, if internal calls start causing incidents, releases require fine-grained traffic management, and access between services has become difficult to explain and verify, the situation changes.

When a Mesh Becomes Justified

A service mesh becomes more useful when problems are no longer local. For example, a failure cannot be explained by the state of a single service, CPU, memory, or application logs. The investigation moves into call chains, latency, authorization errors, non-obvious retries, and differing rules across teams.

In this case, the business is not paying for a “trendy infrastructure layer,” but for control over internal service connections. A mesh helps centralize mTLS, access policies, traffic splitting, retries, timeouts, and observability between services.

This is especially noticeable in critical domains: payments, orders, medical data, B2B operations, user accounts, and internal platforms involving many teams.

But even when the signs clearly point to the benefits of a mesh, it should not be treated as an automatic cure. The next risk is overestimating the installation itself and forgetting that new capabilities require processes and owners.

Where not to overestimate a service mesh

If zero trust exists only as a box-ticking statement, deploying a mesh will not create a workable access model. You still need policy owners, review processes, auditing, and a clear understanding of which services should be allowed to call which others.

The same applies to retries and traffic splitting. Canary releases and fast rollback are useful, but automatic retries can amplify a failure if a downstream service is already degraded. That is why a mesh should be implemented together with operational rules, not as a set of enabled features.

The conclusion is simple: a service mesh is needed when service-to-service communication has already become a separate risk area and the team has the capacity to own this layer. If that level of maturity is not yet in place, it is better to first strengthen the basic network model, monitoring, tracing, release process, and access policies.

When the signs that “a mesh is needed” keep recurring in critical systems, the next question is no longer whether an additional layer is required. What matters is understanding how much it will cost to operate: in terms of latency, resources, diagnostics, upgrades, and the load on the platform team.

Cost of ownership: what the business gets with a mesh

Even if a service mesh is truly needed, it cannot be enabled like a standard Kubernetes feature. A mesh enters the critical path of requests: some access control, routing, timeout, retry, and telemetry rules begin to be handled not only by the application, but also by the infrastructure layer between services.

This improves manageability, but changes the risk profile. An error in a policy, certificate, route, or proxy can affect not just a single service, but an entire chain involved in a user operation.

New components and new responsibilities

A mesh introduces at least two planes. The first is the management layer, or control plane. It stores and distributes policies, routing settings, security rules, and the mesh configuration.

The second is the traffic processing layer, or data plane. This is the layer that sits in the path of actual requests between services and applies the rules in practice.

The data plane can be implemented in different ways:

  • Sidecar — a separate proxy container alongside the application;
  • Ambient mode in Istio — part of the processing is moved out of each pod into the infrastructure layer;
  • eBPF approach in Cilium, some of the networking logic runs closer to the Linux kernel.

For the business, these differences matter not as technical details in their own right, but because they affect latency, CPU and memory consumption, debugging complexity, upgrades, and the requirements placed on the platform team.

Where some of this logic previously lived in application code and libraries, after a mesh is introduced it becomes a separate infrastructure responsibility. This layer must be monitored, updated, troubleshot, and factored into resilience planning.

Where costs and risks arise

The cost of a mesh is not limited to resource consumption. It shows up in several areas:

AreaWhat changes
ResourcesMore CPU and memory are required to process traffic
Latencyp95/p99 may increase, especially in long call chains
DiagnosticsThe cause of a failure may lie in the code, policy, certificate, proxy, or route
TeamDevelopers and SREs need to understand retries, timeouts, tracing, and access policies
UpdatesThe mesh becomes a separate infrastructure product with its own versions and compatibility requirements
Rule errorsYou can accidentally block a necessary call or grant unnecessary access

p95 and p99 are tail latencies. They show not the average speed, but the worst experience for a subset of users. For the business, this is often more important than average latency: a service may look “normal” on average, while some customers regularly experience slow requests.

A common risk is aggressive retries. A team enables retries “for resilience,” the dependent service starts to degrade, and the mesh sends it even more retry requests. As a result, load increases, p95/p99 get worse, and recovery takes longer. A mesh centralizes rules, but a centralized mistake also scales faster.

Why this needs to be calculated before choosing a product

A service mesh affects more than just the platform team. Developers also need to understand which network rules apply to their services, why a request was rejected, where to view traces, and how retries or timeouts change application behavior.

That is why the total cost of ownership should be assessed before selecting a specific solution. The question is not only whether the product can enable mTLS, canary releases, or observability. The real question is whether the team can operate this layer reliably in a production environment.

With mature ownership, a mesh provides predictability: consistent security policies, controlled releases, a clear dependency map, and faster incident investigations. With weak ownership, it can become yet another source of failures and difficult diagnostics.

Once this cost has been assessed, Istio, Linkerd, and Cilium should be compared not by the number of features, but by their operational implications: implementation complexity, team workload, security, observability, performance, and operational risks.

Istio, Linkerd, and Cilium: Different Ownership Models

After assessing the cost of ownership, comparing Istio, Linkerd, and Cilium by feature count alone is pointless. On paper, they address similar needs, but in production they differ in what matters most: how much complexity they add for the team, how they affect diagnostics, what risks they introduce, and which architecture they fit best.

Before comparing them, it is important to clarify several terms. CNI, or Container Network Interface, is the Kubernetes networking layer that connects pods to the network and is often involved in network policies. eBPF is a Linux kernel technology that Cilium uses for parts of its network processing, security, and observability. Envoy is a widely used proxy for traffic processing; it is used in Istio and in some Cilium scenarios. L7 refers to the application layer: HTTP, gRPC, methods, paths, headers, and other request attributes, not just IP addresses and ports.

That is why Cilium is included alongside Istio and Linkerd in this comparison. For the business, this is not a debate about internal implementation, but a choice of ownership model: a separate mesh layer on top of the network, or the evolution of the Kubernetes networking and security layer that has already been selected.

In short, the differences can be summarized as follows:

CriterionIstioLinkerdCilium
Typical roleA flexible mesh for complex enterprise environmentsA simpler entry point into service meshMesh as an extension of Cilium’s networking and security model
Implementation complexityHighLowerDepends on the maturity of Cilium usage
Team burdenRequires strong platform expertiseModerateRequires expertise in Cilium, eBPF, and network policies
ObservabilityRich telemetry, with more entities to consider during diagnosticsSufficient visibility for typical service-to-service relationshipsStrong when paired with Hubble and the Cilium ecosystem
Operational riskPolicy and routing errors can affect large parts of the environmentLower complexity, but less flexibilityRisk of deep dependency on the selected networking layer

Istio is not “better” than Linkerd just because it is more flexible. Linkerd is not “weaker” just because it is simpler. Cilium does not “replace everything” if the team is not already working within its networking model. The question is which trade-off best fits the business and the platform team.

Istio: maximum flexibility and high operational overhead

Istio makes sense where complex policies, advanced L7 traffic management, canary releases, traffic splitting, mTLS, zero-trust scenarios, multi-cluster deployments, and a strict access model between teams are required.

Its strength is flexibility. But that same flexibility becomes a source of complexity. You need to understand the control plane, data plane, sidecar or ambient mode, routing policies, security, telemetry, and upgrades. A configuration error can affect a large part of the environment, especially if critical user flows pass through the mesh.

For this reason, Istio is best suited not simply to “large companies,” but to organizations with a mature platform team. If the team is ready to own this layer, Istio provides a high degree of control. If not, it can become an overly heavy tool.

Linkerd: a simpler entry point into a service mesh

Linkerd is typically chosen when teams need a simpler path to mTLS, basic observability, and resilience for internal service calls. It is well suited to teams that need a mesh without immediately trying to build a complex enterprise platform around L7 routing and multiple operating modes.

Its advantage is a lower operational burden. It is easier for a team to get started, easier to explain the model to developers, and easier to maintain standard HTTP/gRPC services.

However, simplicity also means less flexibility for nonstandard requirements. If you need complex routing, multi-level policies, unusual multicluster scenarios, or deep L7 control, Linkerd may not be sufficient. In that case, it is better to validate the requirements in advance rather than wait for the limitations to surface after implementation.

Cilium: Mesh as an Extension of the Networking Model

Cilium should be viewed differently. It is not just a service mesh, but also a robust networking and security layer for Kubernetes. If a company already uses Cilium as its CNI and builds policies, observability, and network security around Cilium/eBPF, mesh use cases can become a natural extension of that model.

Cilium’s strength lies in the connection between networking, security, and observability. When used with Hubble, it can provide good visibility into traffic and dependencies. Envoy can be used for some L7 scenarios, while the networking logic relies heavily on eBPF.

The main risk is dependence on the chosen networking layer. If Cilium is already the foundation of the platform, this can be an advantage. If the team is prepared to rebuild the networking model solely for the sake of mesh, the cost and risk of that step need to be assessed separately.

How to read this comparison

The practical takeaway is straightforward: choose Istio when you need maximum flexibility and have a team ready to operate a complex mesh. Choose Linkerd when you need an easier path to mTLS, observability, and basic resilience. Choose Cilium when the mesh needs to extend the Cilium networking and security model already in place.

The final choice is best validated not through a feature presentation, but by testing it against your own scenarios: latency, overhead, mTLS, canary deployments, retries, HTTP/gRPC traffic, behavior during failures, and ease of diagnostics for developers and the platform team.

After this comparison, you can move on to the conclusion: a service mesh should be chosen not as the “most feature-rich product,” but as a new area of responsibility that the business and the team are prepared to support.

Conclusion

A service mesh should be chosen based not on a list of features, but on the actual risks associated with service-to-service calls. If Kubernetes, Gateway API, NetworkPolicy, Prometheus, and OpenTelemetry already cover the current requirements, a mesh may add unnecessary complexity. However, if internal traffic affects payments, orders, access rights, SLAs, auditing, and incident investigations, a mesh becomes a justified tool.

Istio, Linkerd, and Cilium represent different ownership models: Istio is for complex scenarios and a mature platform team, Linkerd provides a simpler entry point into a service mesh, and Cilium is for platforms where networking and security are already built around Cilium/eBPF. The final choice is best validated with a PoC: check p95/p99 latency, resource usage, mTLS, retries, canary scenarios, control plane failure, and ease of troubleshooting.

FAQ

Does every Kubernetes platform need a service mesh?

No. Kubernetes does not require a service mesh by itself. A mesh is needed when the complexity of service-to-service calls, along with security, release, and diagnostic requirements, exceeds what simpler tools can handle.

How does a service mesh differ from an API Gateway?

An API Gateway typically operates at the edge of the system: it handles inbound traffic from users, partners, or external clients. A service mesh primarily manages internal calls between services: mTLS, access policies, routing, retries, timeouts, and observability into request chains.

Can a service mesh be implemented solely for mTLS?

Yes, but it is not always the most practical choice. You need to evaluate certificate management, the impact on latency, diagnostics, and the team’s readiness to support a new layer. In some cases, security requirements can be met with simpler mechanisms.

What should be validated in a PoC before deployment?

At a minimum: p50/p95/p99 latency, CPU and memory usage, throughput, error rate, and scenarios involving mTLS, retries, HTTP/gRPC, canary releases, and failures of dependent services. It is also important to assess how easily the platform team and developers can investigate incidents using the new layer.

When should you choose Istio, Linkerd, or Cilium?

Istio is best suited for complex enterprise use cases, flexible L7 traffic management, and organizations with a mature platform team. Linkerd is a better fit for an easier entry point into mTLS, observability, and basic resilience. Cilium is appropriate if the company already uses Cilium as the networking and security layer for Kubernetes.

Can a service mesh degrade performance?

Yes. A mesh adds traffic processing and can increase latency, CPU usage, and memory consumption. The outcome depends on the architecture, operating mode, protocols, traffic volume, and configuration, so there is no universal answer without testing.

Sources


1. Istio Docs — Sidecar or ambient? 


2. Linkerd Docs — Features


3. Cilium Docs — Service Mesh


4. Technical Report: Performance Comparison of Service Mesh Frameworks: the mTLS Test Case

Comment

Subscribe to our newsletter to get articles and news