Infrastructure SRE-Teams@Datadog

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Join our Team

Datadog is built by engineers, for engineers.


We do products, frontend, backend, and everything in between!

Datadog Datadog is the observability and security platform for cloud-scale infrastructure and applications.
Engineering We’re on a mission to build the best platform in the world for engineers to understand and scale
their systems, applications, and teams. We operate at high scale—tens of trillions of data points
per day—providing always-on alerting, metrics visualization, logs, application tracing, network
monitoring, and real-user monitoring for tens of thousands of companies. To deliver a product that
customers love, we tackle and solve complex technical problems at scale, using the cloud and
today’s best open-source technologies.

Infra Tech Stack

Go Java Python Rust React Kubernetes Kafka

Cassandra PostgreSQL Elasticsearch AWS GCP Azure Terraform

Artifact
HAProxy Redis Consul Vault Gitlab CI Bazel
Registries

Infra High Performance


Runtime Infrastructure Reliability Engineering Developer Experience
Organization Transaction Systems

Release Platforms Product Reliability CI Platforms Coordination Platforms

Software Delivery App Workspaces


Application
Core SRE Data Platforms
Release Platforms CI Runtime
Core Resilience
CI Reliability Database Engines

Fabric Product Support

Cloud Networks Language Tools


Core Observability Runtime Efficiency
Network Edge
Chaos Engineering
Fabric Control Plane Capacity Management

Compute Cost Observability

Production Platform
Performance

Compute

careers.datadoghq.com
Team Overviews Runtime Infrastructure
Runtime Infrastructure consists of several teams responsible for all of the underlying
infrastructure, Cloud resources and networking for Datadog.

Release Platforms
Release Platforms is the sub-organization within Runtime that focuses on the deployment of
applications. At a high level, they are loosely responsible for the “CD” side of CI/CD and build the
system that Datadog engineers use to release their code.

Software Delivery App: This team builds the primary user interfaces to Datadog’s internal Software
Delivery Platform. The Software Delivery platform is the mechanism by which Datadog engineers
deploy software every day. This includes a web UI for triggering and monitoring deployments as
well as a system for managing feature flags.

Release Platforms: This group consists of 2 teams.


The Deployments team that builds the systems that enable delivery of code into our Runtime
environments. Today, the 2 main types of resources supported by the Deployments team are
Kubernetes applications and Terraform based infrastructure.
The Registry Platform team is responsible for the artifact storage, management, and replication
across our multiple cloud providers. The registry receives and replicates artifacts being deployed to
any of our environments and aims to abstract differences across cloud provider registries.

Fabric
The teams within this group are dedicated to the development of the application networking fabric.
These teams will work together to deliver a feature rich, observable, and reliable networking layer
that handles such concerns as Service Discovery (consul), RPC, Network Policy, Traffic Flow
Management and other network programming topics.

Cloud Networks: This team is responsible for constructing and maintaining a comprehensive model
of underlying network topologies that Datadog infrastructure is built upon. They leveraged it in the
form of tools and services for capacity planning, observability, operations, incident response and
network infrastructure design.

Network Edge: This team is responsible for cloud load-balancers, internet ingress stack and other
services related to load balancing and traffic management, such as IP Ranges.

Fabric Control Plane: This team enables efficient and secure service communication via the network
control plane, perimeter gateways, and service discovery. They also maintain a portfolio of
perimeter gateways that enable seamless and secure connectivity across Datadog environments, as
well as public Internet.

careers.datadoghq.com
Team Overviews Compute

Production Platform: This team is responsible for Runtime configuration, Secrets, Identity & Access
Control and the Runtime SDK which is a set of tools (APIs for infrastructure inventory, configuration
delivery, tools for authentication and authorization) for how our engineers interact with the runtime
infrastructure.

Compute: The Compute Group is responsible for the underlying infrastructure and Cloud Provider
resources for Datadog. This group consists of 3 teams, Compute Integration - Data Plane - Control
Plane. These teams are running dozens of Kubernetes clusters in production, comprising over 1000
nodes each running on multiple cloud providers and regions. They are responsible for the nodes
underlying the platform, the integrations between platform users and Kubernetes, offering a
scalable Kubernetes API.

Compute Cloud: The Cloud team is responsible for operating cloud resources used by Datadog
production services. This team has a deep understanding of public clouds and strives to be an
expert in managing cloud environments and specifics of every cloud provider Datadog operates in.

Reliability Engineering

Product Reliability
This is a group of Reliability Engineers embedded with development teams to help build reliability
as a core quality in each product and service. Challenges can be very different and specific to each
team and context as each product has different reliability needs, depending on technologies
behind.

Core SRE

Core Resilience: Resilience is about covering the gaps that will exist in production regardless of
how much we invest in reliability. It's about learning how to respond to unknown failure situations
and “preparing to be unprepared”. Core resilience is accountable for cross-team subjects but gets
help from the whole Core SRE organization, they use a whole set of tools to ensure Datadog is not
harmed by adverse events, these include: Game Days, Postmortems, Systems Risk Review, Risk
Remediation, Incident Management Training and Tooling.

Production Support: This team focuses on on-call experience and health at Datadog. Their mission
is to build the internal on-call platform to organize and support our 24/7 rotations (300+), ease on-
call operations, reduce alert fatigue and to help engineers reduce their on-call load.

Core Observability: This team focuses on the adoption of observability best practices and tooling
for the entire org with a strong focus on scalability, efficiency, offering a turnkey setup. They ensure
our observability (through Datadog) is resilient to regional incidents and address systemic
observability blind spots in our production infrastructure and software.

Chaos Engineering: At Datadog, we know that all systems will eventually fail and we decided to
embrace that failure and make it a first-class citizen by testing it thoroughly in our staging
environment and our production environment. The team owns the Chaos Platform, a set of tools
allowing our engineers to explore service failures in different ways autonomously.

careers.datadoghq.com
Team Overviews Developer Experience
The mission of this group is to provide a supported, secure, SDLC that delights Datadog’s
engineering staff by consistently improving our tooling and removing undifferentiated lifting that
teams suffer through now.

CI Platforms

Workspaces: The goal of this team is to streamline the developer process by significantly reducing
set-up time, reducing errors, and extending it with additional features to assist with configuration,
deployments and runtime operations. They provide an out of the box dev experience that allows
high-fidelity integration with other Datadog systems and a rapid iteration cycle.

CI Runtime: This team provides the underlying CI runtime infrastructure, they collect feedback from
customers and collaborate with other teams inside and outside of the Developer Experience
organization to identify any gaps in our current system and use that information to create
additional functional requirements.

CI Reliability: This team provides Datadog Engineering with a reliable, scalable, performant and
extensible build and test execution layer. They maintain, evolve and operate CI-related
infrastructure and runtime components by enabling self-service and automated processes to
remove friction when running or configuring CI infrastructure, allowing teams to focus on their
missions.

Language Tool
This team focuses on integration strategies and underlying tooling to boost efficiency for each
language ecosystem. They’ll be involved in building sdks, standard APIs/libraries for building and
testing distributed services at Datadog.

High Performance Transaction Systems


High Performance Transaction Systems enable application teams to iterate quickly and safely as
Datadog grows its product offerings and expands its footprint across dozens of cloud regions.

Coordination Platforms
This is a foundational building block that other application teams within Datadog use to quickly and
easily create applications for our end users. They provide a core set of primitives to support
internal configuration, libraries and resources to manage large clusters in distributed systems
environments.

Application Data Platforms


Application Data Platforms is a part of the Infrastructure org and is responsible for the reliability,
scalability, and ease of use for Datadog's core application data storage platforms. This includes
subject-matter expertise, automation and orchestration development, direct operational support,
and most recently abstraction layer and client development.

Database Engines
The Database Engines group currently comprises 2 teams, Subject-Matter Experts for Postgres and
Cassandra that are responsible for managed data storage solutions and tooling.

careers.datadoghq.com
Team Overviews Runtime Efficiency
The Runtime Efficiency group drives efforts with an infrastructure-first approach to run a large
portfolio of integrated products across multiple cloud vendors in many geographic installations.
This includes driving cost reduction, improving performance, and driving efficiency across the
entire Datadog platform.

Capacity Management
This team manages the capacity for the entire Datadog infrastructure, understanding the capacity
dimensions (such as compute, k8s, network components, storage etc.) of cloud based services.
They partner with product teams to automate management of dynamically changing capacity needs
of our high-throughput, low-latency distributed systems.

Cost Observability
This team tracks cost telemetry and meaningfully presenting it to product teams will help them
reason about the cost impact of their product choices. At an organizational level, the ability to
understand the cost impact of the changes will enable decision makers to incorporate cost-
estimates during product development.

Performance
This team handles two main tasks: They solve individual performance problems for anyone at
Datadog as a consulting group, but also work on improving efficiencies across the entire production
fleet.

AMERICAS EMEA
Job Openings
Software Engineer - Site Reliability Software Engineer - Site Reliability Engineering
Software Engineer - CI Reliability Software Engineer - Capacity Management
Software Engineer - Resilience Software Engineer - Production Support
Engineering Manager I - Timeseries Query Software Engineer - High Performance Transaction Systems
Engineering Manager I - Dev Exp. / Lang Tools Software Engineer - Compute
Engineering Manager I - Internal Analytics Rsrch

Blog Posts Datadog on Site Datadog on Chaos Datadog on Kubernetes


Datadog Blog
and Video Reliability Engineering Engineering
Resources
Learn more > Learn more > Learn more > Learn more >
To learn more about
Datadog engineering
challenge on
Datadog on gRPC Datadog on Datadog on Kafka Kubernetes Networking
infrastructure and at Scale
Software Delivery
reliability
Learn more > Learn more > Learn more > Learn more >

Introduction to SRE Driving Service Reliability Building Highly Reliable Increasing Reliability
at Datadog Through Autoscaling Data Pipelines at with Modern Monitoring
Workloads on OpenShift Datadog and Chaos Engineering
Learn more > Learn more > Learn more > Learn more >

careers.datadoghq.com
Engineering Our Philosophy
Culture – Datadog is built by engineers, for engineers
– Strong empathy and identification with end-users
– Dogfooding every single day: we monitor Datadog with Datadog
– Strong sense of ownership, “You write it, you run it, you own it”
– Balance between reliability and product velocity

Organization & Practice


– Small teams to enable big individual impact
– Organization built out of simple blocks (easy to understand and scale)
– Quarterly team-level OKRs
– 24x7x365 service, a good fit for continuous delivery
– Agile, daily stand-ups, 1-2-4 weeklong sprints
– Gameday and blameless post-mortem culture

Career Development & Learning Opportunities


– Management and Technical paths that engineers can choose based on their career goals
and strengths
– A wide range of learning resources available from communication and language proficiency
to engineering skills to help employees grow in their careers
– Engineering Brown Bag talks, Datadog Tech Talk, Engineering Demos, Hackathon, Mentorship
Program, Engineering conferences, etc.

Benefits At Datadog, we believe our employees should have the support they need to maintain a strong
work/life balance, grow personally and professionally, and save for their future. To learn more
about our benefits in details across specific locations, please reach out to our recruiting team
during the interview process.

Health and Wellness


We care about the health and well-being of our employees and their loved ones. That’s why we have
competitive benefits that include health, dental, and vision plans for employees, their families, and
their dependents.

Family
In order to support our growing families, we offer best-in-class benefits to help Datadogs navigate
parenthood at any stage. Datadogs get a minimum of 16 weeks of parental leave for birthing parents
and 12 weeks for non-birthing parents.

careers.datadoghq.com
Finance
We want to help you meet your financial goals, so we offer an Employee Stock Purchase Plan (which
allows Datadogs to purchase shares at a 15% discount), financial planning assistance through local
vendors, Apple Employee Purchase Program, and commuter benefits programs.

Personal Fulfillment
Datadog’s personal fulfillment benefits support our employees in their pursuit of healthy and
rewarding activities beyond their daily work life. We have a strong learning culture offering
individual- and team-specific training on an ongoing basis delivered by our Talent Development
team and e-learning platforms. Part of that programming includes Manager Training, which
provides useful tools and frameworks around recruiting, managing, and developing teams. We also
offer fitness reimbursements for employees, their spouses, and dependents.

Social
Now, the fun stuff. These benefits ensure that our employees have a remarkable experience both
in and out of the office. Team outings help colleagues socialize and build strong relationships over
meals at local restaurants or immersive activities like cooking lessons. We also offer a competitive
bonus program for employee referrals.

Paid Time Off


Our paid time off program gives employees the freedom and flexibility to take vacation time and
personal days as needed.

careers.datadoghq.com

You might also like