A peek in the engine room - part 1 of 4


Blogpost by: Rob de Boer
From: Friday 27 September 2013

In the next blogs, I will try to give you a small impression on the choices and methods Onetrail uses to provide our services online.

Most important is meeting Mr. Murphy, the guy that always tells you, ‘Anything that can possibly go wrong, does go wrong’. You just need to be smarter than Mr. Murphy.

How can you cope with this knowledge without spending all the budgets on all possible scenarios, since you cannot predict all possible failures? But still you can prevent most of them.

Onetrail hosts its servers at a datacenter, which provides us 2 uplink feeds on different physical hardware with multiple peering agreements. It also provides per rack multiple power feeds. Everything is provided on a multiple level. But how do utilize this on our own hardware.

Both the uplinks are setup with 2 Cisco ASA firewalls, one running ‘hot’ and the other running in standby. Both the firewalls are connected to each other and sessions are shared when required. If one of the firewalls fails, the other firewall will automatic use all current sessions and traffic is redirected to the newly announced primary firewall. Since both are connected to different network and power feeds, failure in both hardware and power is captured.

Same applies the switches behind the firewalls, all servers are connected with at least two connections to each switch. If a switch fails, the other switch will take over.

Onetrail makes use of virtual machines by using Vmware vSphere, which creates a more blurred line between hardware and software. Most servers are virtualized, running on one of the servers within the racks. All servers have a maximum capacity load of only 75% of the available hardware. If one or two hardware servers fail, the others servers can use the available capacity to host the virtual servers of the failed hardware. Since all network and other settings are stored, all is available at another server.

The (virtual) drives are stored on one of the SAN (storage area network) devices, connected with multipath fiber channel cables with the fiber switch. Multiple SANs are available, some running a backup of trivial components. If one SAN should fail, we would be still able to run our platform on the other SANs. If a disk should fail on the SAN, a spare disk is provided into the RAID config.

This allows us to handle major issues on hardware, power failures and network issues with as little down time as possible.

Rob de Boer

System Architect

