Skip to content

The term “fault tolerance” can be defined as the preparedness of an entity in times when either its software or hardware experiences a malfunction. How well equipped is a company to maintain minimal operational state in any such eventuality.

Understanding Fault Tolerance and Impact Management

What is Fault Tolerance?

Fault tolerance is a very important concept in businesses that are heavily reliant on I.T hardware or software. The most important underlying presumption behind this model is that IT equipment is prone to failure for a host of reasons.

One key concept to understand is that in the event of some failure, whether on the hardware or software side, no entity can maintain full functionality. This is otherwise not the requirement for this concept as it is a very costly proposition.

Also Read:

Designing a Strategy

The most basic step towards developing fault tolerance at the entity level is planning for it. Planning process starts with segregating organizational processes between business critical ones and the lower priority ones.

Once an entity has defined its business critical processes, they will act as the blueprint around which the fault tolerance mechanism will be designed and deployed. Once the planning phase is complete, an entity will consider its existing platform.

Related:

Cloud v/s On Premise Architecture

The allocation of resources for the fault tolerance mechanism will depend on the foremost question about the type of existing hardware the entity is using. We can broadly classify that as either cloud based or on premise.

Cloud v/s On Premise Architecture

Where to Deploy Fault Tolerance Resources?

Once an entity has determined the characteristics and limitations of its existing hardware, it needs to answer the other critical question as to where will the fault tolerance mechanism be put into place.

The answer to this question will heavily rely on the existing hardware footprint of the company. If the business critical processes are running over on premise infrastructure, it would be a prudent choice to deploy fault tolerance hardware on premise as well.

A strong reason behind this approach is that integrating the fault tolerance solution with the mainstream hardware would be easier if both reside over the same ecosystem. This will result in a quick response time and minimal integration challenges.

If an entity has outsourced its business critical processes to a Cloud Service Provider (CSP), it would be prudent to deploy the fault tolerance mechanism over the cloud as well. Integration challenges aside, it is better to choose separate CSPs for both.

This would add an additional dimension of risk diversification as different CSPs for routine and fault tolerance mechanisms will increase the robustness of the solution. This will be particularly helpful in case the downtime has originated from the CSP itself.

How to Achieve High Fault Tolerance?

As discussed earlier, a company will have to identify its business critical processes. Take the example of an e-commerce website. The two most critical components in this case would be the server and back end database, that work together to respond to visitor requests.

Once the critical components of the fault tolerance mechanism have been identified, the next major challenge would be to maintain real time or latest possible backups of the business critical data and processes.

If there is a major lag between the two, the restoration would pay little to no dividends whatsoever. It is absolutely essential that latest backups of the business critical data and processes are mounted over the fault tolerance hardware.

Fault Isolation

Once the backup hardware has taken over and business critical processes are being served over the alternate hardware, it is imperative to isolate the faulty hardware from rest of the entity’s architecture.

This will simultaneously serve two key goals. Firstly, if the service is down as a result of some malicious cyber attack, its impact will remain contained and will also be available for identification, analysis and troubleshooting.

Secondly, by isolating the failed hardware or software, work can immediately be started to restore the full functionality of the affected sub components. If the issue was caused by an external actor, the functioning segment of the infrastructure will also need to be checked.

What is the Correct Fault Tolerance Level?

This is a highly subjective question and for this very reason, there is no definitive answer to this. Each organization must realize that developing fault tolerance will entail financial and administrative costs.

As a company increases its fault tolerance level, the accompanying costs will also rise. So, the answer to this question is finding the perfect mix where the costs of deploying a fault tolerance mechanism are paying the desired dividends.

When the benefits associated with increasing fault tolerance levels start to taper off, that is precisely the point where a company needs to draw the line and stop spending additional valuable resources over this mechanism.

Evaluating Fault Tolerance

Remember, fault tolerance should not be perceived as a one time activity that once completed, will never need to be revisited. It should be an ongoing process and as the operations of an entity change in nature or scope, it will have to realign the mechanism.

This realignment may not necessarily be associated with increasing resource allocation as at times, an evaluation may reveal the need for reducing the fault tolerance footprint for a host of reasons. So, evaluating fault tolerance should be a periodic exercise.

Conclusion

In today’s era of cut throat competition and unrelenting customer expectations, a business simply cannot afford total shutdown or non availability of services. A customer may tolerate inferior service momentarily but no service at all is surely a red flag.

Therefore, it is imperative for organizations to devise, plan, implement and evaluate a robust fault tolerance mechanism. This will go a long way in ensuring the long term sustainability of an organization and retain valuable customers.