Key Factors in Designing Hardware Architecture for High-Load Systems, evolution in hardware architecture design, illustration

Peerobyte / Community / Blog / Infrastructure / Key Factors in Designing Hardware Architecture for High-Load Systems (HLS)

Key Factors in Designing Hardware Architecture for High-Load Systems (HLS)

1 year ago 10 minutes reading time

Should a business decide to build its own IT infrastructure, the path of its technicians shall be thorny, since high load systems must be built on a foundation of hardware infrastructure. And it would hardly be a whim, but rather a significant factor when it comes to the efficiency, reliability and the final cost of the entire undertaking. If a business or enterprise chooses Saas/IaaS for its solution, the IT department’s load gets significantly lighter. In this case, almost everything, including the problem of building out the hardware part of the HLS, falls to the service provider. All that’s left for the client is essentially to use cloud services.

However, due to various circumstances, a number of companies still choose to build their own infrastructure. In addition to the on-premises option, there is a more flexible and popular solution - transitioning to a hybrid infrastructure. In order to achieve this, we need to start with the hardware. That is where our tale of high load systems begins.

There are several important components to be taken into account when designing hardware infrastructure:

The location of the equipment - that is, the geography of the data center/alternative premises, taking into account its technical parameters, environmental conditions, security requirements and user tasks;
The type of hardware to be selected, taking into account the wishes of the business, technical requirements and maintenance conditions;
An assessment of the physical equipment’s configuration, taking into account the projected dynamics with peak loads, allowing one to avoid or at least reduce “over-” or “underkill” when purchasing hardware;
Dependence on vendor or technology, where it's important not to go overboard with mixing and matching, but also not to get bogged down with a single vendor that suddenly rolls back the product line;
Considerations for future scalability, i.e., thinking directly about how the business or production will evolve, what resources and at what point will be needed;
Managing risks as they really are, including forecasting the consequences of decisions made, working through options to reduce these risks and creating an action plan when the risk "sets off".

Let’s look at these in more detail below.

Equipment location

The choice of where to place selected equipment is not a random one, but rather dictated by correctly set requirements from both the business and the IT department.

In order to avoid mistakes, the following criteria are usually taken into account when designing a VNS:

Data loss tolerance and allowable data recovery time (RPO/RTO) metrics and disaster protection requirements, which directly affect the geography of distributed clusters;
Security requirements, safety, remoteness, redundancy of communication channels and the availability of a suitable site, all of which are reflected in selected location - internal placement on company premises or external in a commercial data center of another organization;
On a basic level, to understand where the main pool of customers of the infrastructure will be. For example, if the main audience is in Europe, it may not be a good idea to place the hardware in Australia.

These criteria, in turn, must be considered as part of two more components:

The low-level part, i.e., the location and physical equipment;
the high-level part, i.e. the software architecture.

Ideally, high performance should be provided at both levels. In practice, however, the strongest and most productive part can complement and balance the second. So, if a company's data center or facility doesn't meet the highest requirements, you can compensate for its shortcomings with an efficient software architecture and arrange for compliance with availability requirements at the software level.

But there is no one-size-fits-all solution here. It can vary greatly depending on business challenges and industries. Even the same problem can be solved in different ways.

The type of hardware to be selected

At the start of the design process of an HLS, when you need to choose a vendor and specific equipment configurations for long-term operation, there are two common traps you can fall into: buying a small amount of the most expensive, powerful and latest generation hardware with increased reliability and redundancy of each node, or trying to save money and acquiring a lot of low-efficiency equipment without redundancy and with low MTBF (Mean Time Between Failure), but at a low cost, therefore trying to make up for quality with quantity.

High load systems (HLS), type of hardware, illustration

In practice, how can one sail between the Scylla of latest generation, expensive hardware and the Charybdis of mountains of cheap junk? There is a way to design around this. It consists of a minimum requirement set of actions. Here they are:

Discuss the necessary service availability metrics (SLA) to be served by the infrastructure with the business;
Clearly define the technical requirements - specifications - for the hardware component of your infrastructure;
Define the geographic location, location conditions, and related requirements they impose;
Understand whether there are enough technicians to service the equipment in the chosen location, and what the cost of their work will be;
Predict the likely dynamics and surges of peak loads so as not to end up in downtime during a hypothetical influx of customers;
Consider further scaling - both horizontal and vertical.

In addition, in order to stop running up against the extremes of extravagance or mass-produced junk, it is necessary to also consider the future software component at the initial stage, as it could be able to offset some of the shortcomings of the hardware. To take a simple example, you can have one super-resilient server, or you can have two less advanced servers and set up load balancing or an active-passive cluster, which may even give you higher availability in the end.

Given well thought-out answers to these problems, it often turns out that the number of options is not as large as once thought - it varies from a few to just one.

Dependence on vendor or technology

Dependence on a single vendor, whether hardware or software, is not the most secure position to be in when designing infrastructure in an era of change, where markets are rearranged in a context of companies taking over each other, old brands fading, businesses changing for political reasons or simply switching to new lines of hardware or software and abandoning old ones.

One can find plenty of telling examples.

For example, 3dfx Interactive, a developer and manufacturer of 3D graphics processors and graphics cards, failed with cards based on the VSA-100, betting on the development of a new chipset called Rampage. As a result, they were forced to sell their assets to a certain Nvidia, which, in turn, issued a release announcing that it’s discontinuing support for 3dfx products. All that customers were left with was a narrow window to exchange their 3dfx equipment for Nvidia. Those who couldn’t do it or ran out of time were left without support and without replacement.

On the other hand, hardware and software that is too mismatched creates headaches when trying to combine it all while minimizing lags and bugs. And again there is the question of technical support for things that are no longer supported by the vendor (EOL/EOS or End of Life/End of Support) - you get it :)

An exception to this may be large integrators and service providers who often develop their own software and server solutions, taking into account security and specific business tasks, and control all the processes themselves, from architecture to production. Classic examples are Google, AWS, Meta, etc.

In the end, achieving the ideal is difficult, and calculating 100% of the risks is impossible, but this does not negate the need for planning. At one point or another, it can offer you a lifeline.

Evaluation of hardware configuration under load

An accurate understanding of the technical characteristics of the required equipment allows you to design a flexible infrastructure that meets the objectives of the business or enterprise and takes into account future development. In the absence of a pragmatic approach, two unpleasant events can follow:

"Underkill" or when the equipment/its capacity/volume of storage proves insufficient under sharply increased loads. The likely outcome is lost profits and reputational damage. It's always a shame when a site goes down after an influx of customers due to a costly advertising campaign.
"Overkill" or equipment downtime even at maximum load. At this point we’re dealing with the issue of TCO (total cost of ownership). In this case you need to consider optimization of physical hardware and software: disabling unnecessary equipment, setting up additional services, etc. Although of course this situation is less common :).

To avoid running around like a headless chicken when these mishaps occur, it is worth taking into account important indicators when configuring equipment:

configuration standardization,
performance and utilization at set loads,
Single and double failure domain sizes,
peak load performance and utilization,
redundancy level,
once again, the availability of equipment and components in the market, vendor announcements of future production plans,
availability of necessary specialists in the local labor market.

Scaling for future development

A plan prepared in advance is key. Otherwise, you may be faced with the fact that the software infrastructure has reached its limit on the number of nodes, and a complete reorganization of the hardware infrastructure is required for efficient operation or any normal operation at all.

During planning, you can consider not only increasing the capacity of your own equipment (by increasing the number of servers when scaling horizontally or simply installing additional RAM, processors, hard drives, ie, increasing capacity when scaling vertically), but, if there are no strict limits on it, build a hybrid infrastructure together with a cloud provider, allowing you to quickly increase and decrease capacity - for example, during a seasonal influx of customers.

In addition to physical hardware, it is worth keeping in mind the availability/absence of licenses for the required software in the required quantity, their expiration date, the need for renewals - and what will happen if there is no possibility to do it.

Risk management as the final stage

Everything that was written above can be described as part of the risk management process. And with all of this already in place, you can now move on to a comprehensive assessment of possible risks.

With each particular choice, it is worth assessing the risk of making that decision and thinking through methods to reduce the risk, or accept it, but with full awareness and responsibility.

For example, for some, it does not make sense to build a disaster-resistant geographically distributed infrastructure (distributed across different countries, or even continents) - the cost is simply not justified. At the same time, another business or enterprise might be ready to shoulder any expenses for the sake of stability and reliability in order to avoid much costlier consequences.

Risk management as the final stage, High load systems (HLS), illustration

There are no one-size-fits-all solutions. You have to find your own individual path in each case.