Availability concepts
Load balancing: It is the process of distributing a server or network load over a multiple servers or networks. An example of load balancing is a clustered solution where each server in a clustered pool shares the load as per the design parameters. Load balancing refers to efficiently distributing incoming network traffic across a group of back-end servers, also known as a server farm or server pool. A load balancer acts as the "traffic regulator" controlling the load on your servers and routing client requests across all eligible servers for optimum capacity utilization and ensures that no one server is overloaded, which could degrade performance. If a single server goes down, the load balancer redirects traffic to the remaining online servers. When a new server is added to the server group, the load balancer automatically starts to send requests to it. This feature can take networks to the next level. It increases network performance, reliability, and availability. Multi-layer switches and DNS servers can serve as load balancers.
High availability: It refers to systems that are durable and intended to operate continuously without failure for a long time. Usually, HA servers or systems conform to the strict environmental conditions and have built-in fail-over mechanisms. High availability is incorporated in the system design so that the up-time of a system is maintained as per the designed standards under all circumstances. High availability is usually design specific where as the fault tolorence is device or network specific.
Power Management:
Power Converter: Converts the voltage from alternating current (AC) to direct current (DC). Power converters are used to step power down from higher voltage to lower voltage.
Power inverter: Converts the voltage from direct current (DC) to alternating current (AC). Power inverters are used to step power up from lower voltage to higher voltage.
Power redundancy: Duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe.
Uninterruptible Power Supply: Provides emergency power to a load when the input power source fails.
Mean Time to Repair(MTTR): MTTR (mean time to repair) is the average time required to fix a failed component or device and return it to production status.
Mean time to repair includes the time it takes to find out about the failure, diagnose the problem and repair it. MTTR is a basic measure of how maintainable an organization's equipment is and, ultimately, is a reflection of how efficiently an organization can fix a problem.
Mean Time Between Failures (MTBF): The most common failure related metric is also mostly used incorrectly. "Mean time between failures" or "MTBF" refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.
Recovery Time Objective(RTO): The recovery time objective (RTO) is the maximum amount of time that a process or service is allowed to be down and the consequences still considered acceptable. Beyond this time, the break in business continuity is considered to affect the business negatively.
Recovery Point Objective (RPO): The recovery point objective (RPO) is the maximum time in which transactions could be lost from a major incident how much you are willing to walk away from in order to get everything up and running again. Both RTO and RPO have to be balanced in coming up with a policy for how to deal with incidents.
SLA Requirements:
Service Level Agreement (SLA): SLA is a document that describes the minimum performance criteria a provider promises to meet while delivering a service. It typically also sets out the remedial action and any penalties that will take effect if performance falls below the promised standard. It is an essential component of the legal contract between a service consumer and the provider.
Master License Agreement (MLA): MLA is a document created by a software company that defines how their creation can be used .
Memoranda of Understanding (MOU): MOU is a document that expresses mutual accord on an issue between two or more parties. MOU are generally recognized as binding, even if no legal claim could be based on the rights and obligations laid down in them.
SOW: A Statement Of Work (SOW) is a formal document that captures and defines the work activities, deliverables, and timeline a vendor must execute in performance of specified work for a client.
Redundancy and high availability (HA) concepts
Type of backup sites:
Hot Site: A Hot Site can be defined as a backup site, which is up and running continuously. A Hot Site allows a company to continue normal business operations, within a very short period of time after a disaster. Hot Site can be configured in a branch office, data center or even in cloud. Hot Site must be online and must be available immediately. Hot site must be equipped with all the necessary hardware, software, network, and Internet connectivity. Data is regularly backed up or replicated to the hot site so that it can be made fully operational in a minimal amount of time in the event of a disaster at the original site. Hot Site must be located far away from the original site, in order to prevent the disaster affecting the hot site also.
Warm Site: A Warm Site is another backup site, is not as equipped as a Hot Site. Warm Site is configured with power, phone, network etc. May have servers and other resources. But a Warm Site is not ready for immediate switch over. The time to switch over from the disaster affected site to Warm Site is more than that of a Hot Site. But less cost is the attraction.
Cold Site: Cold Site contain even fewer facilities than a Warm Site. Cold Site will take more time than a Warm Site or Hot Site to switch operation but it is the cheapest option. Cold Site may contain physical space, electricity and air conditioning but will require days or even weeks to set up properly and start operation from Cold Site.