Infrastructure as code: Successfully scaling to new heights
How do you prepare a product to meet the demands of a rapidly growing user base? Meet someone who knows.
Alan Garfield is a busy man. Not only does he need to figure out how to keep Learnosity’s performance both speedy and secure, but he also has to do so while the system delivers 500K tests per minute and responds to 5.5 billion API requests per month.
And things are only gathering steam. With each back-to-school season, the company’s Principal Engineer of Infrastructure needs to prepare for more users and even greater test load. To find out how he keeps the system ahead of the curve, we asked him about the many spinning plates he maintains, along with the benefits of adopting an “infrastructure as code” approach on the job.
Insights on infrastructure
Given the scale at which Learnosity is used, there’s an amazingly high uptime rate. How do we plan for and measure such success?
I think our uptime is directly related to the fact we’ve optimized our system so that there are minimal single points of failure (SPOF) and our infrastructure is as “immutable” as possible right now. Having quality scaling solutions with fault tolerance built-in allows the systems to heal themselves and recover without much human intervention. Other reasons for our high uptime is our ability to shift our load around any issues that may occur and fail gracefully when we can’t!
What other goals do we have for infrastructure? What do we want to enable the system to scale to?
One goal is maximum throughput and reliability for the lowest cost while maintaining the highest resource efficiency possible.
Our applications have been tested at a scale far beyond our peak load, and in that testing, the biggest bottleneck is where the data has to be stored at rest (e.g., databases, disks, etc.). We use in-memory caching to alleviate this issue. However, any significant growth eventually becomes a data storage problem, no matter what you do. We have data sharding in our database infrastructure to help deal with this by splitting this load, but there are still upper limits. When the administration of the data storage starts to outweigh the cost of reworking the technology, you start looking at other techniques like early request load splitting and compartmentalizing clients into isolated silos.
What’s the best piece of advice you have for designing for redundancy?
Removing all single points of failure is key. This can include staff too.
Scaling isn’t actually that hard, but doing so within a budget in terms of time and money is the real trick. Rockstar staff with the knowledge locked in their heads don’t scale well either. By using infrastructure as code techniques you can tease this knowledge into the provisioning systems and so rely less on the rock stars to handle the day-to-day running of the system. The rock stars are probably better used moving the infrastructure forward in terms of future directions anyway. The day-to-day should be boring and uneventful. Rock stars should hate boring!
Can you tell us more about the “infrastructure as code approach” at Learnosity? How does it help improve our service?
Infrastructure as code is an idea where no single thing is manually created. This is like setting a template for the systems. The template can be reused without needing explicit documentation or skills in terms of staff to achieve the same results. It’s basically the blueprint for how to build a “Learnosity”.
I’ll to explain this by using a coin making analogy. A skilled craftsperson could create a coin from raw materials using hand tools. His or her work would be unique and hard to replicate. If the coin got damaged or lost there would be a considerable effort in either repairing or replacing it. As with pretty much any company starting out, Learnosity will have systems that are quite similar to this. A core team of individuals will create and configure custom systems to get the product out the door and each system will be handcrafted to fit whatever purpose the business needs at that particular time.
Now, say you used a coin press instead of by hand, and you fixed your design into the pressing plates. You can now theoretically create as many coins as you’d like given the raw materials required while expending considerably less effort. Each coin is basically identical. Their value is fixed and equal. They are completely interchangeable. If you lose one down a drain or it’s damaged, you can simply replace it with another and move on.Once your business is established, you're then racing the clock to innovate and adapt to change. Click To Tweet
The idea is that the infrastructure’s value isn’t in its creation. Instead, the value derives from the tasks it performs and how easily those tasks can be shared. Learnosity grew from the original custom-built systems to “immutable” infrastructure. Starting out, nobody could invest the time and effort into this type of automation. You’re basically racing the clock to prove your business to the world. However, once your business is established, you are then racing another clock to innovate and adapt to change. Neither of these is possible if your teams are still handcrafting coins to replace old ones each day.
Striking a balance
What are some of the major obstacles of scaling a product to such a degree?
When it comes to infrastructure, the main obstacles aren’t necessarily technical in nature. With cloud computing, anyone these days can go “fast” if they’re prepared to pony up the dollars. Even poorly written applications can somewhat scale if you’re prepared to pay for the horsepower to push the dog up the hill.
The real skill is balancing the right technology in the applications, the appropriate design criteria to meet the requirements of the clients, and to be as efficient as possible with the resources available at your budget level.
The only place where throwing money at the problem doesn’t work is reliability. Eventually, a poorly architected and inefficient application will fail.
What are some of the more general challenges the tech teams face and how do they approach them?
Keeping the systems secure and safe is probably one of our largest challenges. As we grow and integrate more features, keeping our customer data safe and secure is a top priority. With security baked in from inception to deployment through solid network architecture; proven technology built into the operating systems; testing and adopting the latest techniques into our applications; constant communication between our teams; and the security resources available on the internet, we work very hard to maintain our defenses.As we grow and integrate more features, keeping our customer data safe and secure is a top priority. Click To Tweet
We regularly organize third-party companies to come and perform penetration tests and security assessments of our systems and code. We also run security workshops internally to help educate staff and share knowledge on the challenges we face and how to avoid them during development.
Probably the next challenge would be the scale of our client data. The amount of data we process daily can be a heavy weight to lift. With many sources of data all feeding into our systems it can be a challenge to find new ways to get performance and quality results from an ever-growing pool of data. It’s something we also work very hard at, and we hope our customers can see the results in their products.
Here comes the technical bit
We recently migrated to CentOS. What was the reasoning behind this?
We moved to CentOS to solve some annoying package management issues. The two OS’s are basically the same at their core, but the package management of each is completely different. Ubuntu uses “apt” packages and CentOS uses “RPM”. They are functionally similar but the coordination and governance of them outside of the OS’s are quite different.
For example, keeping an apt mirror to control the versions of our dependent packages required investment in storage well above that of RPM. Maintaining “aptly” was a cost in administration we wanted to remove.We're extremely good at quickly shifting infrastructure if we consider something to be better. Click To Tweet
Another advantage in the migration was that we gained a bunch of security features like SELinux, which increases the inherent security of our systems by defining strict rules as to what resources an application can access.
What about the move to Nginx and PHP-FPM?
We moved from Apache to Nginx/PHP-FPM for two reasons. Firstly, Apache uses a thing called mod_php. Each client request to Apache would create a “process” for each client. In that same process, Apache would always create a PHP handler too. This means you have a one-to-one mapping of requests to PHP interpreters.
Now, this sounds ok, but as you scale up this stuff becomes very inefficient. For example, static file requests such as images will have a PHP interpreter chew up resources (e.g. memory) for zero use in that request. If you’re serving a lot of static file requests this can be extremely inefficient (and really you should have a good CDN in front of your servers anyway). However, the PHP interpreter is basically always created whether it’s needed or not.
Second, with this one-to-one mapping of requests to server resources, you now have an upper limit on the number of requests you can handle at once (e.g. bursts). The amount of memory used to create all these PHP resources has reduced the amount available to you and now you’re no longer using memory efficiently.Throwing money at a problem won't help. In the end, a poorly architected application will fail. Click To Tweet
Using Nginx and PHP-FPM as an alternative breaks the bond between the “request” and the “processing” of the PHP. Nginx accepts the requests. If it’s a static file request it can handle that itself without needing the resources required for PHP. If the request is for PHP, Nginx can pass this request onto PHP-FPM for processing. PHP-FPM works as a pool of “handlers”. They don’t need to be created each time; they are reused and kept in a shared pool. This means for a given number of incoming requests there are fewer PHP “handlers” required to handle the given PHP application load.
There are also other benefits to this decoupling. Say, for example, you have 500 clients all hitting Apache with a similar request that requires database access. Because of the one-to-one nature, you now have 500 database connections to your databases. With Nginx/PHP-FPM these same 500 clients all connect via Nginx; however, because of this decoupling of PHP-FPM, these same requests can be handled with about one-tenth of the memory resources and about one-fiftieth of the number of database connections.
Changing for the better
What are your two favorite tools and why?
One would be Salt Stack. It gives us the ability to not only have our instances provision themselves automatically but also allows us to orchestrate changes from a single point to the thousands of instances all at once.
Another favorite is Vagrant, which allows us to test and develop infrastructure code locally using VMs so we can test provision scripts and so on without using AWS (Amazon Web Services) resources.
How do you keep up to date with the latest practices?
Being a dutiful netizen and staying informed and educated on the latest techniques, tools, and strategies in tech.
One thing we’re extremely good at in Learnosity is quickly adopting new technology and shifting infrastructure if we consider something to be better. Nothing is fixed in stone and all ideas will be considered, even if it means it’ll require a lot of effort to make a switch.
We also make a point of attending several conferences every year including AWS re:Invent which really spurs us on when we see new AWS products and sparks conversations and ideas for the future.
Why do you like working at Learnosity – and in the Infrastructure team specifically?
When I joined Learnosity I was charged with bringing our systems to a level of scale that Mark and Gavin [Learnosity’s co-founders] knew was needed but hadn’t yet achieved. My contemporaries had gotten Learnosity into AWS and had given Learnosity their first level of scale. However, the infrastructure was brittle, weak, and prone to fault.
Mark trusted me five years ago with refitting and replacing our infrastructure with something I felt would accomplish our goals. The work we’ve completed in that time, and the challenges we’ve overcome, is a big part of the reason I work for Learnosity. Taking something raw and unrefined and directly reforming it into a system that can handle the load we currently see is something that gets me out of bed in the morning – and sometimes really early morning!Our performance is our clients' performance, so we do all we can to improve it. Click To Tweet
I work in infrastructure because we are the pointy end of the stick; we are where the rubber meets the road, so to speak. Our clients base their businesses on our technology so our systems play a big part in not only Learnosity’s future but also in that of our clients too. Our performance is their performance, so it’s our job to do everything we can to make the “hard stuff” easy for them.