Career Profile

, a Cloud Technology Leader, became Director of Operations Engineering at Adobe (NASDAQ: ADBE) after the acquisition of TubeMogul (NASDAQ: TUBE). As TubeMogul's sixth employee and first operations hire, Nicolas has built and grown Adobe/TubeMogul's infrastructure over the past ten years from several machines to over eight thousand servers that handle ±350 billions requests per day for clients like Allstate, Chrysler, Heineken and Hotels.com.

Adept at adapting quickly to ongoing business needs and constraints, Nicolas leads a global team of site reliability engineers, cloud engineers, software engineers, security engineers, and database architects that build, manage, and monitor Adobe Advertising Cloud's infrastructure 24/7 and adhere to "DevOps" methodology. Nicolas is a frequent speaker at top U.S. technology conferences and regularly gives advice to other operations engineers. Prior to relocating to the U.S. to join TubeMogul, Nicolas worked in technology for two decades, managing heavy traffic and large user databases for companies like MultiMania, Lycos and Kewego. Nicolas lives in Danville, CA and is an avid fisherman and aspiring cowboy.

Highlights:

  • Built from the ground up and lead a global team of 60 operations engineers (FTE, vendors worker, contingent workers)
  • Global Team with staff in 4 different timezone (Ukraine, China, India, US) to ensure 24/7 support (Follow The Sun)
  • Support a ±250 global product and engineering team
  • Built and support a ±8,000 assets infrastructure with 6 datacenter locations in US, Europe, and APAC.
  • Built a multi-cloud solution with cloud bursting capabilities to support product scale and latency requirements
  • Design and deployed a solution to deliver services in Mainland China with a POP in Beijing and direct connectivity to HKG Data Center
  • Responsible for infrastructure P&L with goal on TI cost as percent of Gross Profit
  • Define strategy and tactical plan to ensure SOC2/ISO/SOX compliance

Technologies: Linux, Puppet, Python, Ruby, PHP, Java, Go, Jenkins, Graphite, Ganglia, Grafana, Nagios, Sensu, AWS, HAproxy, OpenStack, Zookeeper, Kafka, Couchbase, MySQL, ElasticSearch/ELK, Splunk, HBase, Hadoop, Ubuntu, Debian, Docker, Container, Kubernetes, KVM, TCP/IP, Open vSwitch, etc.

Public Talks & Papers

Top 5 Machine Learning and Self-Healing Techniques used by SRE

December 11th, 2018 on LinkedIn

Over the past few years I had the unique opportunity to see a start-up, TubeMogul, going through hyper-growth, an IPO, and an acquisition by a fortune 500, Adobe. In this journey, I was exposed to a lot of technical challenges, and I work on systems at an astonishing scale, i.e., over 350 billions real-time bidding request a day. It allowed me to build some strong personal opinions on the role of an SRE and how they can help transform an organization. This post cover self-healing design, forecasting algorythm, anomaly detection, risk classification, and provide real use cases from Adobe SRE teams.

Full Paper |

Improving Adobe Experience Cloud Services Dependability with Machine Learning

November 29th, 2018 at Machine Learning for DevOps Summit, Houston, TX

Adobe Experience Cloud is a collection of best-in-class solutions for marketing, analytics, advertising, and commerce. All integrated on a cloud platform for a single experience system of record. The Adobe Experience Cloud's SRE team works hand-in-hand with the Product and Engineering teams to build dependable services. In this presentation you will learn how the team leverage Adobe's artificial intelligence and machine learning engine to build predictive auto-scaling and self-healing services.

Slides |

Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service. (Won Best Disruptive Idea Award)

October 17th, 2018 at, the 29th IEEE International Symposium on Software Reliability Engineering (ISSRE 2018), Memphis, TN

The advertising industry faces numerous challenges in achieving its goal of targeting a given audience dynamically and accurately in order to deliver a meaningful brand message. Near real-time, low latency delivery of dynamic content, the sheer volume of information processed, and the sparse geographic distribution of the intended eyeball traffic all drive the complexity of building a successful experience for the end user and the brand. Additionally, the competitiveness of the industry makes it critical to preserve low operational expenses while delivering reliably at scale. In attempting to address the above, we have found that a distributed infrastructure that leverages public cloud providers and a private cloud with open infrastructure technologies can deliver dynamic advertising content with low latency while preserving its high availability. But network or physical utility infrastructures can’t be relied on to ensure the service dependability. We show that the complexity of the networks, the sparse geographic distribution of eyeballs, the risk of data center failures, and the increase of encrypted transactions call for thoughtful architectures. The introduction of modern practices, failure injections, and self-healing mechanisms allowed us to improve the service fault tolerance while optimizing for latency and significantly improve our service reliability.

Abstract | Award | Full Paper | Slides | Teaser | DBLP |

See All Archived Public Talks And Papers