Managing an infrastructure with a global footprint can be a challenging task for a centralized operation team. Bringing business awareness to your operation is key to success. In this talk you will learn how TubeMogul Operation Team easily manage its on-call rotation, how we centralize monitoring information from multiple datacenter for efficient on-call and how we define our business priority with a business focused dashboard.
Nowadays most start-up are using cloud solution, while some will go from public cloud to hybrid solution, all have to deal with fast growing infrastructure. In this presentation, I will go over few solutions that we implemented at TubeMogul while growing from 20 servers to over 700 servers in 4 years and dealing with over 10 billions HTTP requests a day. With so many informations and data every day, it’s hard to get a good read of what really matter and to alert the right person. You will learn how we integrated Nagios with Google Calendar for easy on-call rotation management, how we centralized our Nagios information from 5 different DC to a common dashboard, how we make daily maintenance report for pro-active action.
Managing a server infrastructure in a fastpaced environment like a start-up is challenging. You have little time for provisioning, testing and planning but still you need to prepare for scaling when your product reaches the tipping point. Amazon EC2 is one of the cloud providers that we experimented with while growing our infrastructure from 20 servers to 500 servers. In this paper we will go over the pros and cons of managing EC2 instances with a mix of Bind, LDAP, SimpleDB and Python scripts; how we kept a smooth working process by using NFS, auto-mount and shell-scripting; why we switched from managing our instances based on tailor-made AMI/Shell-scripting to the official Ubuntu AMI, Cloud-init and puppet; and finally, we will go over some rules we had to follow carefully to be able to handle billions of daily non-static http request across multiple Amazon EC2 regions.