architecture for hls pt1

Designing High-Load System Architectures: Key Considerations and Best Practices

Today we are launching a series of articles on designing high-load information systems, and information systems in general, since almost any system has a chance of becoming highly loaded. It is impossible to start the design journey without knowing the fundamentals. That's why – disclaimer – this is not a how-to guide that can be easily applied to your application infrastructure. This article comes in very handy in a familiarization format to understand the base.

This article will be useful for developers and architects, especially for beginner HLS architects, because it applies the most basic information to high-load systems (HLS) and real-time systems (RTS).

All parameters described below have a direct impact on software architecture design decisions.

The role of data and metadata feeding the lakes

All modern technologies are based on data, metadata, and events. They can be briefly described using the following definitions:

  • Data is the actual information received from internal and external sources used for processing, analysis, and storage.

Possible forms of data: any information, from text messages to images and videos. 

  • Metadata is data about the content of other data used to identify, sort, and simplify handling large amounts of information.

Possible forms of metadata: what time a text file was created, who created it and how, its size, type, how it is used, what changes were made, and so on.

  • Events are notifications of actions that take place in the system. They may be triggered by the user, an application, or other system components.

What they're used for: performance monitoring, error detection as well as security and access control.

Data, metadata, and events each serve distinct roles, yet they are deeply interconnected. For example, data can generate events, while events can impact data. Metadata, in turn, helps track changes in both. Therefore, these three elements need to be manageable and accessible inside the system.

Modern technology simply cannot function without them — any technology. High-load systems and real-time systems are also not feasible without them. Moreover, processing vast amounts of data and events from both internal system components and external sources is one of the primary drivers of load in high-load (HLS) and real-time systems (RTS).

In real-time systems (RTS), events are typically categorized into three types to simplify processing:

  • Synchronous events are predictable events that occur at a known frequency. Examples include messages related to data exchange and synchronization. Generally, the synchronous type refers to data transmission with acknowledgment of receipt.
  • Asynchronous events are completely unpredictable events, such as control commands, alarms, and alerts. In a broader sense, the asynchronous type is understood as data transfer without acknowledgment of receipt.
  • Isochronous events (a subtype of asynchronous) are regular events occurring at a specific time interval. Examples include diagnostic messages about the status of system components.

Based on their characteristics and functionality, data is categorized in relation to:

  • High-load systems (HLS) — Some data is ancillary, meaning it is crucial for the system's functioning but remains invisible to users. This could be transaction data, event logs, sensors and IoT, users, and more.

Often, processing this type of data in HLS mode is sufficient. If text variables can be predefined and exchanged using their identifiers, real-time (RT) mode may not be necessary.

  • Real-time systems RTS — Data classification here primarily depends on whether specific data needs to be processed in real-time (RT) mode, i.e., with reference to time. Additionally, data that is frequently accessed or updated is often considered real-time. This can be transaction data, event logs, sensors, IoT devices, monitoring, control, geolocation, video surveillance, etc.

One important detail in data processing is balancing the amount of RT data to be processed and the metadata. Without proper optimization, excessive workloads can arise, potentially reducing system efficiency. Even if the system's overall performance stays stable, a bottleneck may occur when a single component becomes overloaded, creating a weak point in the infrastructure.

Another key consideration is that the system does not differentiate data by importance by default. To classify and manage data effectively, these requirements must be established during the design phase, such as categorizing data into critical, non-essential, and auxiliary categories.

Ways of exchanging data

Many nodes enable the execution of diverse tasks in both HLS and RTS. At the physical level, they can take the form of various devices, including workstations, servers, IoT devices, and switches. Each of these differs in load distribution, availability, and operational continuity, factors that must be considered when designing a hardware infrastructure. It is, therefore, common to divide this physical variety of nodes into three categories:

Distributed Control Systems (DCS) provide several key functionalities, including:

  • efficient management of distributed resources (processing power, memory, network bandwidth, etc.)
  • coordinated operation of various system components (data and message flow management, access to shared resources, conflict resolution, etc.)
  • distributed data storage and management in the system (replication of data on different nodes for fault tolerance, scalability, and fast access to data),
  • data processing and analytics (may include tools to perform distributed computing, data flow processing, machine learning, analytics, etc.)
  • load balancing, which enables load balancing between the nodes of the system,
  • fault tolerance and reliability.

Distributed Service Systems (DSS) provide several key functionalities, including:

  • request management (including receiving requests, routing them to the appropriate system components, and controlling the processing of requests),
  • the possibility of horizontal and vertical scaling of the system to handle a large number of requests,
  • balancing the load between the various nodes and system components to ensure an even load and efficient use of resources,
  • reliability and integrity of transaction execution (including competitive access control, error handling, and fault recovery),
  • resiliency (including data backups, replication of system components, disaster recovery mechanisms, and making services available in the event of node or component failure),
  • session and state management to maintain and restore user state when switching between different system components,
  • monitoring and performance management to identify bottlenecks, optimize system performance, manage resources, and ensure high user performance.

Distributed Computing Systems (DComS) provide several key functionalities, including:

  • distributed data storage and processing between different nodes in the system,
  • horizontal and vertical scaling of the system (allows the system to adapt to changing workloads and ensures high performance),
  • task and load sharing between nodes of the system (includes mechanisms for request routing, dynamic redistribution of tasks, and automatic adaptation to the changing load)
  • management of tasks and schedule in the system (includes planning and distribution of tasks between nodes, task control, and optimization of computing resources),
  • system resilience (may include data replication, duplication of nodes, backup, and disaster recovery mechanisms),
  • synchronization between the system nodes.

Of course, the functionality of the systems listed above may vary depending on the specific requirements and characteristics of the system. For example, Distributed Computing Systems and Distributed Service Systems manage things at some basic level — whether handling their own computations or serving users — so they can also be classified as Distributed Control Systems. To simplify our discussion, we’ll generalize these concepts under a single term and continue exploring Distributed Control Systems and the efficiency of interactions between their components.

Using an event-driven model, DCS nodes can be interconnected and exchange messages via TCP and UDP protocols. However, since their underlying technologies function differently, their practical applications also vary.

TCP

TCP is primarily known for its guaranteed message delivery. It is considered more of an asynchronous unicast protocol (i.e. one-to-one transmission), meaning each TCP connection is handled separately. However, due to its additional overhead, TCP typically incurs slower performance than other protocols.

TCP is generally not ideal for Real-Time Systems (RTS) due to several key factors:

  • Connection setup delays – In RTS, minimizing connection establishment time is critical. With numerous connections, cumulative setup time increases, reducing overall system efficiency.
  • Message obsolescence – New messages may become outdated while in transit, failing to reach their destination in time. In system management, even slightly outdated data can be detrimental.
  • Limited support for asynchronous messaging – TCP is not inherently designed for high-performance asynchronous messaging, making it difficult to align with the real-time asynchronous world.
  • Broadcast inefficiencies – When using TCP in broadcast mode, delays may arise due to circulating acknowledgment packets, leading to excessive system load.

UDP

UDP is a synchronous protocol, which allows it to work more efficiently through multicast (i.e. from one source to multiple recipients) or broadcast (i.e. from one source to all possible recipients) modes. Examples of UDP use cases are media streams (where loss is not as important as speed - like classic SIP), and where data flow is paramount (syslog is a protocol for message delivery).

UDP connectivity is also better in this regard for the asynchronous mode issue. With it, it is possible to track diagnostic messages, some of which contain diagnostic metrics. From these, message quality metrics such as individual component and process health metrics, message delivery confirmation metrics based on specified requirements, etc., can be obtained.

However, all of UDP’s advantages do not make it the only right solution. The choice has to be conscious. For example, if message delivery and integrity are important, it is worth remembering that TCP implements these in the protocol, while with UDP, you will have to invent something.

Transactions

High-load systems and real-time systems have their own tolerance limits. For example, transactions can be used in HLS, but not in RTS. The reason for this is contained in the concept of transaction itself.

The formal definition is: "an ordered set of operations that transfer a database from one consistent state to another". To put it simply - it is a group of operations, combined into one logical unit, which is considered complete only if all operations in it have been executed. Thus, you get "all or nothing" :) A good real-world analogy is a bank transaction, which is only considered completed if the money has left one account and reached another.

The basis of this approach is a set of ACID requirements that ensure data integrity:

  • Atomicity, which does not allow intermediate states - a transaction is either fully executed or not.
  • Consistency, which maintains data consistency by committing only fully completed transactions (EOT - end of transaction).
  • Isolation, which does not allow concurrently executed transactions to influence each other.
  • Durability, which is responsible for preserving changes in a successfully completed transaction, so that no further failures or other problems at lower levels occur.

As mentioned above, these conditions are more relevant to HLS, since transaction synchronization is feasible in a predominant number of cases. The transactions themselves are processed in queues. For this purpose it is efficient to use all available computing hardware resources.

As for real-time systems, the unacceptability of using synchronous transactions is due to the logic of the system itself - starting from three nodes, synchronicity breaks down and cascading or co-dependent/chained transactions occur:

In this case, node A is waiting for the transaction to pass all nodes in one direction and return with the result following the same path, and only then will the transaction be considered completed. As a result, there are immediate delays and blockages. And this is only the beginning.

If the software architecture has been designed correctly, it is possible to compensate for the latency of waiting for complex cascading transactions to execute. One solution to this is to increase the number of transactions, but keep in mind that the processing of new transactions from the queue may start before an active transaction is finished. Because of this cascade, transactions may partially take on the properties of asynchronous processing, as a result of which the initiators of the transaction will be affected by the blocking of the execution of these transactions.

High-Load System Architectures, Chain transaction, scheme
Chain transaction

Such a system can extend endlessly in length.

High-Load System Architectures, Chain transaction asynchronous, scheme
Chain transaction asynchronous

This results in asynchronous cascading transaction processing, in which the average transaction processing time increases because a new transaction from the queue is started before the previous one is completed.

Such a process can increase efficiency, but also lead to imbalance due to blocking times and delays not being determined in advance. This is further exacerbated if the target system does not operate on a rigidly defined logic, but follows fluctuating user behavior.

As far as RTS is concerned, the idempotency model is often used here instead of traditional transactions. It is based on the principle that repeating an operation with the same input data does not change the state of the system. This means that the result is the same even if an operation is repeated or executed several times.

In RTS, events may be handled several times due to the possibility of duplicate, lost or re-delivered messages. In such cases, the idempotency model can guarantee the correctness of the event processing.

This article does not end there but rather seamlessly moves into part two, where we will discuss distributed computing, the limitations of distributed systems, a single data format, basic data operations, and features of CRUD data operations in distributed systems.

Subscribe to our newsletter to get articles and news