Peerobyte / Community / Blog / Infrastructure / Designing High-Load System Architectures 2: Key Considerations and Best Practices

Designing High-Load System Architectures 2: Key Considerations and Best Practices

1 year ago 10 minutes reading time

In the first article of this theme Designing High-Load System Architectures: Key Considerations and Best Practices we talked about different kinds of data, types of their exchange and transactions, and here we will tell you about limitations and basic operations with data in distributed systems.

Distribution

In distributed management systems, there are many nodes that run transactions and are involved in their processing. In order for such a system to remain coherent, it is necessary to plan the load using continuous load balancing.

To do this, there are two possible conditions to consider:

Nodes processing transactions do not affect each other's computational processes. Balancing such a system is easy.
Nodes depend on each other algorithmically. In this case, computational algorithms executed in the form of cascading transactions can use the data of nodes, which, in turn, also participate in transaction processing. Such a system is much more difficult to balance, as unpredictable total load may occur during the process.

One of the tools for solving such a problem is database management systems with their advantages and disadvantages:

Relational databases, which have special tools for processing cascading and distributed transactions. A certain efficiency of calculations is achieved by optimizing the DBMS configuration and the structures of processed data. But real-time (RT) data processing is complicated - it is necessary to establish where to read from, where to write to, how to find, process and save the index. As a result, the speed is slower than in specialized systems.
And the rest :) Usually they are NoSQL, NewSQL with their efficient ways of data processing and balancing distributed interactions. They come to the rescue of VNS and SRV when it comes to speed.

But, since no tool is perfect, they are usually combined. The end result is a single interoperable system with effective integrated database solutions of each type.

Limitations of distributed systems

Every distributed control system (DCS) is inseparable from the CAP-theorem on the implementation of distributed computing and its principles, according to which, during computation, only two out of three data requirements can be met:

Consistency (Consistency), which removes contradictions between data in case of simultaneous computation in all nodes, if in simple language - different nodes always have the same data;
Availability, which gives a correct response to each request to the distributed system. In this case, the responses of all nodes do not necessarily have to coincide;
Partition tolerance, in which the relevance of the response from each of the isolated sections of the distributed system is preserved even in case of partial failure of this system.

On typical RSUs, the theorem holds in three combinations:

there is consistency and always available nodes, but no partitioning resistance,
there is partitioning stability and available nodes, but no consistency,
there is partitioning resilience and consistency but there is no constant accessibility to all nodes.

Schematically and theoretically, this whole mess looks like this:

High-Load System Architectures, the CAP-theorem, scheme — **The CAP-theorem**

But in practice, tasks are often set in such a way that it is necessary to complicate everything and mix a real-time system, a highly loaded system and a distributed control system at once. Here are the variants that we meet in the end:

Distributed VNS, which gives birth to VNRSU,
real-time VNS, which gives rise to RNRSV,
real-time DCS, which gives rise to RCMS,
distributed real-time SNCs - already a rarer centaur beast, called by the shorthand phrase VNRSRV. Somehow it even reminds of crazy Soviet-era abbreviations like this one:)

Taumata hill in New Zealand with longest Maori place name sign, photo

As a result, the following matryoshka is most often obtained: SRV is a part of VNS, and VNS is a part of RSU. Again on the diagram everything looks like this:

High-Load System Architectures, Connection of system type, scheme — **Connection of system type**

In addition, concessions and admissibility help to marry the hedgehog and the urchin. And, as a rule, an experienced architect gives in to the rigidity of fulfillment of some of the requirements of the CAP theorem. He is saved by the variety of possible states of consistency, it can be:

weak, which is characterized by the fact that the system may temporarily be in a state where it does not guarantee either Consistency or Availability. This can be due to time delays or conflicts and differences in data replicas. This situation means that the system always guarantees the availability of data and the execution of operations within specified time constraints.
strict, which is characterized by the fact that the system explicitly selects and guarantees one of the two basic properties: Consistency (Consistency) or Availability (Availability), seeks to ensure one of these properties without compromise, but does not guarantee complete consistency of data between different nodes or parts of the system.
Long-term, where the system allows data consistency or data availability to be temporarily compromised in order to avoid failures, provide better performance, and improve overall system availability. In the long-running state, the system resolves temporary conflicts between copies of data, allowing nodes to operate autonomously, not wait for responses from other nodes, and provide localized data availability.

Intermediate, where the system is in an intermediate state that is not a strict state of either Consistency (Consistency) or Availability (Availability), and may disrupt both data consistency and availability for a period of time. The cause of this state may be due to time delays, failures, or network partitioning.

These properties allow to solve a different range of tasks. For example, a distributed control system with inclusion in an automated control system can be built. However, it is worth bearing in mind that two things are strictly forbidden in this case:

node separation, otherwise an important subsystem may go out of control,
data inconsistency, otherwise the system operators will receive different data, which will make their work uncoordinated, and eventually a disaster may occur.

In this case, a brief loss of communication between nodes is acceptable. And, moreover, the system remains undamaged for a short period of time, even in case of separation of an integral system or subsystem. In this case, the time of operation in the autonomy mode of subsystems is determined by appropriate regulations. After the set interval expires, ideally, the system should be stopped as a failure. But in practice, there is a question of excessive cost and danger of such a course of action, as, for example, in the case of a nuclear reactor of a nuclear power plant with the cost of shutdown in hundreds of thousands of dollars.

The opposing principles are well illustrated by industrial automation with banking systems. Payroll days in banks heat up the atmosphere with their high workloads - there are a lot of transactions, and they all have to reach a substantial number of recipients almost simultaneously. But there's a "but": only one payroll message is sent. And it may not reach the client. Or there will be a lag of some time. In this case, he may be upset or worried. However, the worries will pass when the person goes to the application and sees that everything has arrived, everything is fine. In general, there is no special tragedy, at most, the spent nerves of the bank manager.

Uniform data format

A wide variety of data formats inevitably leads to time lags due to their unnecessary transformations, which significantly slows down the speed of the system. Ideally, of course, a single formalized protocol should be used for communication between all nodes in the DCS. A highly simplified data structure also helps, and compliance with the selected protocol provides efficiency and speed of their processing, reading, writing and rewriting.

In practice, in most cases the data structure of an NMC is unchanged, only the values of the variables vary in most cases.

In most cases, the configuration of the NMC data structure is included in the metadata. In this case, the data used in the exchange between nodes can be removed from the metadata. Possible exceptions are specialized data. It is also customary to include in the metadata everything that can be excluded from the traffic of data exchange between nodes at the time of operation under load. As a result, all complex relationships that identify hierarchy, connectivity, and structural affiliation are included in the metadata.

A data structure configuration is typically loaded in two ways:

from a file,
from the DBMS at the moment of nodes startup or in the minimum load mode.

Basic data operations

In both high-load and real-time systems, basic database functions (CRUD) are usually organized by importance:

Read (read) and overwrite (update) operations have higher priority.
The creation (create) and deletion (delete) operations have lower priority.

But that's not all.

It should be taken into account that dynamic memory is allocated during creation and released during deletion. It turns out that both of these processes are costly in their essence and, therefore, should not be performed in real time.

There is an even more important point: while create operation itself is safe, the same cannot be said of delete. In this case, there is a useful recommendation: in SRS, do not use physical deletion. Instead, it is better to mark the structure as deleted. You should return to physical deletion either at the moment of maintenance or when the system is unloaded for a long time.

It is generally recommended that no dynamic data operations should be used in an NMS if they have the potential to compromise data integrity.

Peculiarities of CRUD operations with data in distributed systems

It is possible to consider CRUD operations only if there are distributed nodes. There should be at least two of them in the system, but preferably more. For the distributed operations themselves, the following conditions are usually imposed:

required efficiency,
minimal delays,
in some cases parallelism should be supported.

Create operation

The result of the create operation is the appearance of such a number of entity copies, specified by the target system configuration.

Delete operation

Works similarly to the create operation.

Read operation

The read operation reads the data of the node closest to the source of the request. If there is no request for a long time, the read cell may stop being updated, and in some cases it may not be filled at all when a new request is made.

This situation creates a dilemma: whether to update the cell when a request is made, and how often, with what frequency, or to always keep it up to date. It is up to each individual to decide which of these is more beneficial, based on his or her specific priorities and capabilities.

Update operation

The area of responsibility of the update operation, if I may say so, is to keep the data in the memory cell up to date. Relevance is not static, but has a priority that depends on the importance of the variable being processed. In a real-time system, it is considered good practice to divide such operations into three categories:

guaranteed (strict update), actually guarantees the transfer of changes across all distributed target nodes;
non-guaranteed (fast update), has no confirmation of operations and does not guarantee delivery to the target nodes. However, it is the fastest way to modify a data cell in a distributed manner. It can be used to work with variables whose loss of values is compensated by the next value update cycle;
guaranteed with confirmation (hard update), which is the most reliable way of transferring changes. From the disadvantages - the slowest execution due to the highest resource consumption. Therefore, it is usually advised to minimize their use. However, they are indispensable when working with variables responsible for critical data, where you should use them. A good example: passing a control command.

The whole variety of update operations helps keep data up-to-date while it is changing, effectively synchronizing changes and providing system redundancy.

This is the end of a brief overview of the microcosm of highly loaded systems, so that in the next series of articles we can talk about highly loaded systems in more detail.