Single NFS-server to redunant NFS-storage-Collection of common programming errors

Redundant NFS (in fact, any redundant storage) is not trivial.
Plan to spend a good amount of time (and capital) on this if you really want it to work well.

There are generally two options available to you:

Option 1: Buy redundant storage devices

This is the fastest (and usually most expensive) option. Pick a vendor who makes a storage device with redundancy features that meet your needs, give them the company credit card, and try not to get tears on the invoice.

The two major benefits of this route are that it’s fast (you get a pre-built solution you can just roll out by following the manual) and it’s supported (if you have a problem you call the vendor and yell until they fix it).

Option 2: Build it yourself

This site has a good outline of building a redundant iSCSI/NFS cluster using Debian Linux. It’s from 2009, but the principles are sound.
Specific step-by-step instructions on how to build this sort of environment is beyond the scope of Server Fault, but I can give you a rough outline of what you’ll need:

  • Shared (or replicated) storage In order to have redundancy on your storage layer you need to have the same data accessible from multiple locations – either by replicating it in real time, or by connecting everything to a shared pool of disks. A SAN is the usual way to meet the shared storage requirement. This is still a single point of failure, but when you put all your eggs in one of these baskets you make sure it’s a VERY good basket.

    DRBD or ZFS replication can meet the requirement for replicated storage if you elect to go that route – it’s probably cheaper than a SAN, and both technologies have developed to a very reliable state.

  • Multiple “front-end” systems Now that you have the storage worked out you need to make it accessible through redundant “front-end” systems – these are the machines that are running the NFS server (or whatever you use to serve up the disk to clients).

    You need at least two, running high-availability/failover software so if/when you lose one the other can take over. IP failover is the “easy” option here (if one box goes down the other assumes the “live” IP address).

  • Multiple physical paths to storage All the storage redundancy in the world doesn’t help you if everything goes through one wire.

    You need to ensure that the client machines have multiple physical paths to get back to the storage front-ends, otherwise a failed switch leaves you with the same single-point-of-failure situation you’re trying to get out of.

Building your own redundant storage usually takes longer than a vendor solution, and you’re supporting it yourself (which means you need to be comfortable with the technology involved).
The major advantages are cost (you can often build the environment cheaper than vendor-provided solutions) and flexibility (you can tailor the solution to meet your needs and integrate with other parts of your environment – for example your backup system).

Stuff you need either way

You will need a test plan* prior to going live in production.
Ideally you should have it before you even start your build-out (knowing what failures you’re defending against will help you design your system).

Your goal in testing is to demonstrate that the absolute worst confluence of failures will not leave you in a position where you’re losing data (and ideally won’t cause an outage because your storage became inaccessible).
You may not find or test every possible failure scenario, but write down all the ones you can think of and make sure to test them. You don’t want to wait until your first day of live production use to discover that losing one disk in the standby machine can cause the primary to crash — at that point it’s too late to fix.