Peerobyte / Community / Blog / Cloud DNS private zones, split-horizon DNS, TTL, and DNS failover for cloud infrastructure

Cloud DNS private zones, split-horizon DNS, TTL, and DNS failover for cloud infrastructure

Last updated: Jun 16, 2026 15 minutes reading time

In cloud infrastructure, DNS manages not only names but also, indirectly, the actual path traffic takes. The same app.example.com name can direct an external user to a public load balancer, while a service inside a VPC or VNet receives a private endpoint. DNS therefore needs to be designed not around the record itself, but around the question: who is making the query, and what address should be returned?

The basic model is as follows:

Public DNS serves the external-facing perimeter: users, partners, public APIs, CDN, and external load balancers.
Private DNS zones serve internal networks: VPC, VNet, project network, hybrid connections, private databases, and internal services.
Split-horizon DNS allows a single name to return different responses depending on the source of the query: externally, a public endpoint; internally, a private one.
TTL defines how long a DNS response can remain in the cache. It is a trade-off between the speed of changes and cache stability.
DNS failover routes new DNS queries to a backup endpoint, but it does not work like an instant switch.

The main risk is that a DNS response depends on more than just the record in the zone. The result is affected by the query source, the private zone’s association with the network, the recursive resolver, the OS cache, the application cache, the TTL, and the state of the service. As a result, checking DNS from a developer’s laptop does not prove that a pod in Kubernetes, a VM in a private subnet, or a service from a hybrid network will receive the same response.

TTL also should not be treated as a guaranteed failover time. If a record has been in use for a long time with a TTL of 3600, lowering the TTL to 60 seconds immediately before an incident will not clear caches that already contain the previous response. A low TTL helps well-behaved clients replace stale responses more quickly, but it increases the number of DNS queries and does not control connections that are already open.

DNS failover is useful as part of a resilience strategy, but it is not a substitute for a DR plan. It can change the response for new name resolutions, but it will not restore data, perform replication, guarantee RPO/RTO, or resolve application-level issues. For critical systems, DNS should be one layer in the overall recovery architecture, alongside load balancers, health checks, data replication, failover procedures, and regular testing.

Why DNS Design Should Start with the Query Source

In a simple infrastructure, DNS is often treated as a directory: a name maps to an IP address. In the cloud, this model quickly becomes too limited. A company may have a public API, internal services in a VPC or VNet, private databases, multiple regions, and hybrid connectivity to an office or data center network.

In this type of environment, DNS starts to affect the traffic path. The same FQDN may resolve differently from the internet, from a private subnet, from a Kubernetes cluster, or from a hybrid network. An error in a zone, a private DNS association, or a TTL value can send a service to the wrong endpoint, increase latency, expose an unnecessary public route, or break failover.

That is why design should begin not with choosing an A, CNAME, or Alias record, but with the basic name resolution model: who makes the DNS query, which resolver it passes through, which zone should respond, and which endpoint is considered correct.

Public DNS and private DNS: who is querying and which endpoint receives the request

After the introduction, the key question in Cloud DNS is not “what is the zone called,” but “where did the query come from.” A DNS name by itself does not determine the route. The same FQDN can return different answers depending on which resolver handles the query: public DNS from the internet or a private resolver inside a VPC, VNet, or hybrid network.

Public DNS: an external client receives a public endpoint

For an Internet user, www.example.com should typically resolve to a public entry point: a CDN, a CNAME to an external service, a public IP address, or a public load balancer. This response is returned by a public DNS zone accessible from the Internet.

Internet user

→ recursive resolver

→ public DNS zone

→ public endpoint / CDN / public load balancer

This is the external perimeter. Its purpose is to direct users, partners, and external systems to the public application or API correctly.

Private DNS: an internal service gets a private endpoint

Another example is db.internal.example.com. External users do not need this name. It is queried by a VM, a container, a Kubernetes workload, or an internal service in a VPC/VNet.

The response is usually the database’s private IP address, a private load balancer, an internal API, or an address used for service discovery. This response is returned by a private DNS zone linked to specific networks.

Service in VPC

→ cloud/private resolver

→ private DNS zone

→ private IP / private load balancer

This leads to an important practical implication: checking DNS from a developer’s laptop does not prove that a service inside a VPC will receive the same response. Queries go through different resolvers and may end up in different zones. DNS should therefore be checked from the environment where the client actually runs: from an external user’s browser, from the application container, from a VM in the appropriate subnet, or from a connected hybrid network.

How to avoid mixing up public, private, and split-horizon DNS

To avoid confusing DNS resolution models, it is useful to compare them by the source of the request and the type of endpoint returned:

Approach	Who makes the request	Which endpoint is returned	Main risk
Public DNS	A user or external service on the internet	Public IP, CDN, CNAME, public load balancer	Routing external traffic to the wrong destination or exposing unnecessary public names
Private DNS	A service inside a VPC/VNet, project network, or hybrid network	Private IP, private load balancer, internal API, database	Testing from the wrong network and getting a misleading result
Split-horizon DNS	External and internal clients for the same FQDN	Different responses in the public and private environments	Receiving an unexpected response because the wrong zone or request source is used

Public and private DNS solve different problems. The former serves the external environment, while the latter serves internal networks. Private DNS, however, is not a complete security control for a service: it hides records from public DNS, but it does not replace IAM, firewalls/security groups, network policies, or application-level access control.

This separation provides architectural clarity: public names point to public entry points, while private names point to internal resources. In real cloud architectures, however, external and internal clients often need to use the same FQDN. This is where split-horizon DNS comes in: a single name exists in two views and returns different responses depending on where the request comes from.

Split-horizon DNS: one name, different responses from different networks

How split-horizon works

Split-horizon DNS, or split-view DNS, means that a single fully qualified domain name exists in two views: public and private. The response depends not on the name itself, but on which resolver and which zone the query passes through.

The flow looks like this:

Internet user

→ public resolver

→ public DNS zone

→ app.example.com

→ public endpoint

Service in VPC/VNet

→ private resolver

→ private DNS zone

→ app.example.com

→ private endpoint

The same FQDN does not imply the same DNS response. That is why you need to test not just the name itself, but the name from a specific network context: from the internet, from a VPC/VNet, from a Kubernetes cluster, from a VM in the target subnet, or from a hybrid network.

A practical example is a SaaS application with the name api.example.com. Externally, this name points to a public load balancer or CDN. Inside the cloud, the same name returns the address of a private load balancer or internal API. This is convenient for applications: there is no need to maintain separate public-api.example.com and internal-api.example.com names, and less conditional logic is required in the configuration.

Where split-horizon breaks down

The main risk is overlapping zones. If a private zone is not associated with the required VPC or VNet, a service inside the cloud may not receive the private answer and may get the public endpoint instead. In a hybrid setup, a similar error occurs when an office or data center resolver sends the query to public DNS rather than to the cloud private resolver.

As a result, a Kubernetes cluster, a CI/CD runner, a virtual machine, and a developer’s laptop may see different answers for the same name. Each answer will be “correct” within its own resolution path.

Before introducing split-horizon, it is useful to check:

Which network the DNS query is sent from;
Which zone should respond — public or private;

Which endpoint is expected in the response;
Whether the private zone is associated with the required VPCs/VNets;
Whether there is a conflicting zone in the hybrid DNS setup;
Whether an internal service is accidentally using a public endpoint.

This check reduces the risk of traffic being silently routed along the wrong path. Otherwise, an internal service may use a public entry point, with different latency, unnecessary outbound traffic, different firewall/security group rules, and potential certificate and availability issues.

Split-horizon solves the problem of using consistent names, but it requires disciplined zone management and validation from different networks. Even when a client receives the correct answer, the next question remains: how long that answer will live in the cache and how quickly changes will reach clients. This is where TTL becomes important.

TTL: a trade-off between cutover speed and cache stability

TTL is the lifetime of a DNS response in cache. It does not mean that a record will “take effect at the provider in 60 seconds,” and it does not guarantee that all clients will see the new IP address at the same time. The authoritative DNS server may already be returning the new value, while a recursive resolver, the OS cache, or an application cache may still be holding the old response.

A simple example: api.example.com is being moved from one load balancer to another. If the record had been served for several hours with a TTL of 3600, lowering the TTL to 60 seconds immediately before the cutover will not force existing caches to forget the old address. They have already received a response with a one-hour lifetime and, if they behave correctly, will continue to use it until the TTL expires.

How to choose a TTL

A TTL should not be chosen on the assumption that “lower is better.” A low TTL helps compliant clients evict stale responses faster, but it increases the number of DNS queries and reliance on resolvers. A high TTL reduces noise and load, but makes changes take effect more slowly.

TTL	Where it is appropriate	Benefit	Limitation
30–60 seconds	Planned cutovers, dynamic records	Stale responses expire sooner	More DNS queries, with no guarantee of an immediate transition
300 seconds	Many application services	A balance between manageability and load	Changes still have a delay
3600+ seconds	Stable names that rarely change	Fewer queries and a more stable cache	Changes and cutovers are slow

For critical changes, the TTL should be lowered in advance. A proper plan is to first reduce the TTL, wait for the old higher TTL to expire, then change the record and verify name resolution from the required VPCs/VNets, hybrid networks, and external locations. After the change has stabilized, you can restore a higher TTL for normal operation.

A high TTL is useful when a record is stable. It reduces DNS load, reduces external dependencies for repeat requests, and makes client behavior less noisy. A short TTL is therefore not a universal improvement, but a tool for records that genuinely need to be managed quickly.

Why TTL Is Not the Same as Failover Time

TTL is a trade-off setting, not a guarantee of recovery time. It only limits how long a response is stored by the components in the chain that honor the TTL.

This directly affects a future DNS failover: the higher the TTL before an outage, the longer some clients may continue using the old endpoint. Lowering the TTL at the moment of failure is too late—the old responses have already been distributed across caches.

Even a low TTL does not eliminate all delays. A client may have an application cache, an OS cache, a recursive resolver with its own behavior and an increased TTL, or an already open connection to the old endpoint. DNS does not revoke such connections or force the application to recreate its connection pool immediately.

TTL therefore explains why DNS changes have inertia. DNS failover should be viewed not as an instant switch, but as a mechanism that changes responses for new DNS queries while taking health checks, caches, and client behavior into account.

DNS failover: how new DNS lookups switch over and why it isn’t instant

How DNS failover works

A typical scenario: api.example.com points to the primary load balancer in us-east, while the standby endpoint is in eu-west. The health check detects a failure in the primary region, the routing policy takes effect, and the authoritative DNS server starts returning the eu-west address.

In simplified terms, it looks like this:

Health check detects failure

→ authoritative DNS returns secondary endpoint

→ new DNS queries go to secondary endpoint

However, the switchover will be seen first by clients that make a new DNS query after the response changes. Others may still use the old cache or keep an open connection to the primary endpoint.

The actual failover time is made up of several factors:

Failure detection
DNS response update
Cache lifetime
Client behavior

This is not an SLA, but a latency model. It shows why a low TTL helps but does not make failover instantaneous.

What DNS Does Not Cover

DNS does not manage already established TCP connections, keep-alive, connection pools, or retry logic in the application. If a client keeps a connection open to the primary load balancer, it may continue trying to use it until timeouts occur, retries run, or the connection is re-established.

Private endpoints have a separate limitation: external health checks often cannot see a private IP inside a VPC or VNet. For this reason, the health of a private service must be checked from a network environment that can reach it, or exposed externally through an internal health signal.

Because of this, DNS failover should be treated as a gradual switchover for new name resolutions, not as a single event for all clients. It is useful as a routing layer during a failure, but it does not replace a load balancer, application-level retries, or a complete DR plan.

Conclusion

DNS in cloud infrastructure should be designed based on the source of the request and the expected endpoint. Public DNS serves the external perimeter, private zones serve internal networks, and split-horizon DNS allows a single FQDN to return different responses for different networks, but it requires precise zone attachment and validation from real environments.

TTL determines the inertia of DNS responses in the cache, while DNS failover changes the route only for new name resolutions. Therefore, for critical systems, DNS must be part of the overall fault-tolerance design, alongside load balancers, health checks, data replication, RTO/RPO, recovery procedures, and regular testing.

FAQ

How does a private DNS zone differ from a public DNS zone?

A public DNS zone responds to queries from the internet and typically returns a public endpoint: a CDN, a public IP address, a CNAME to an external service, or an external load balancer. A private DNS zone is accessible only from associated VPCs, VNets, or hybrid networks and returns internal addresses: a private IP address, a private load balancer, an internal API, or a database endpoint.

When should you use split-horizon DNS?

Split-horizon DNS is appropriate when a single FQDN must work for both external clients and internal services, but return different endpoints. For example, from outside, api.example.com points to a public load balancer, while inside the VPC it points to a private one. This approach should be used only when you have control over the zones, network associations, and DNS checks across all critical environments.

Why doesn’t a low TTL guarantee instant failover?

TTL limits how long compliant caches retain a DNS response, but it does not control every component in the chain. Recursive resolvers, the OS cache, the application cache, and already open connections may retain the old state longer than expected. As a result, a low TTL speeds up the removal of the old response, but it does not make DNS failover instantaneous.

What TTL should you choose for a cloud service?

There is no one-size-fits-all value. For records that change, a short TTL—such as 30–300 seconds—is often used, but you need to account for the increase in DNS queries and the dependence on resolvers. For stable names, you can set a higher TTL if slower switching does not violate availability requirements.

Does DNS failover replace a load balancer?

No. DNS failover changes the response to new DNS queries. A load balancer handles traffic and backend pools after the client has already received an address and initiated a connection. In a fault-tolerant architecture, DNS failover and a load balancer can complement each other, but they do not perform the same function.

What is especially important for private endpoints during failover?

The health check must be able to reach the private service from the appropriate network segment. External checks often cannot access a private IP inside a VPC or VNet, so an internal health signal, a metric, an agent in the correct network, or integration with cloud monitoring is needed.

Sources

1. Google Cloud — Cloud DNS zones overview

2. AWS Route 53 — Considerations when working with a private hosted zone

3. AWS Route 53 — Configuring failover in a private hosted zone

4. Cloudflare — DNS TTL reference

Comment

Similar texts

See more posts

10 Jun 2026