System Development Manager, AWS Resilience, AWS Incident Response Job at Amazon Development Center U.S., Inc., Seattle, WA

KzdVZGpQWjhHQkIwWkJyRGJmZHRZcUdxL1E9PQ==
  • Amazon Development Center U.S., Inc.
  • Seattle, WA

Job Description

DESCRIPTION

AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we’re the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain — and we’re looking for talented people who want to help.

You’ll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.

AWS Resilience owns service to prevent and response to availability and security issues for all AWS Services. In other words, we’re the people who keep the cloud running. We work on the most challenging problems, with constant new services and possible failure modes to prevent — and we’re looking for talented people who want to help.

You’ll join a diverse team of software, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security and availability. You’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.

AWS Incident Response is at the heart of the high availability of Amazon Web Services. We make customer impacting events shorter and less frequent by driving large scale event and incident response. Our automated tooling quickly identifies the cause of an issue and helps mitigate its impact, and much of our engineer time is spent on projects to improve the tooling and automation. We also provide manual incident management for AWS and other Amazon groups, directing the resolution of an issue with service teams, and diving deep into those events to drive improvements to the tooling. It's an exciting time to join our team as we are growing and expanding our offerings.

As a System Development Manager on the team you will manage automated tooling roadmaps and delivery for the detection and resolution of issues within AWS and Amazon infrastructure. You will also spend a portion of your time ensuring your team efficiently directs the resolution of high visibility incidents in conference calls and global teams. Using data learned from those incidents you will drive further improvements into our automation, tooling, and processes so that the next event is shorter or avoided entirely. You will coordinate across project teams to expand use of our tooling to additional areas across Amazon. If you're looking for a team with great growth potential and an opportunity to make a huge impact, this is the team to join.

Key job responsibilities
Define and Deliver Business Priorities
You will be a key contributor and owner of the direction of the global AWS Incident Response team. You will define, plan, track and deliver on strategic goals for the team, while ensuring that the team remains unblocked and focused.

Cross-Site, Cross-Team Coordination
You will be responsible for coordinating with your counterparts to ensure that a clear communication channel exists between AWS Operations teams. You will also work closely with systems and product teams to create and maintain a proper processes for monitoring and alarming on services. A portion of this process will include establishing both solid operational acceptance criteria and a concrete feedback loop for resolving deviations from that process.

Incident/Change Management
You will be the point of contact for inquiries regarding engagement processes and issues within the global Amazon platform during your team’s coverage. Responsibilities include delegation of emergent engagement issues to team members, driving initiatives regarding improvements to existing tools & processes and providing feedback on new practices & procedures in order to scale with the rapid expansion of the AWS Services and customer base.

Performance Management/Team Health
You will own all facets of performance and career management for the team.

About the team
As a Senior Engineering Manager in AWS Infrastructure Reliability (AIR), you'll lead teams that enhance Amazon's incident response capabilities across our global cloud infrastructure. You'll drive improvements in event detection, management, and resolution processes while reducing false alarms and resolver minutes during critical incidents.

This role demands both technical leadership and strategic vision as you collaborate with service teams across AWS to implement reliability improvements. You'll guide engineers through complex challenges, optimize call leadership processes, and develop systems that ensure rapid recovery from service disruptions, directly enhancing customer experience through improved availability.

AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we’re the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain — and we’re looking for talented people who want to help.

You’ll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.

BASIC QUALIFICATIONS

- 5+ years of direct experience with cloud hosting technologies (AWS, Azure, etc./ 5+ years experience managing an engineering team operating at scale.
- Deep understanding of infrastructure delivered through the software development lifecycle in an API-enabled environment – including agile development, software /patterns, and modern cloud services.
- Experience in implementing, supporting, and evaluating tools and services with a security, scalability, and performance mindset
- Experience in implementing, supporting, and evaluating tools and services with a security, scalability, and performance mindset
- Ability to handle multiple competing priorities in a fast-paced environment
- Excellent written and verbal communication skills and ability to get ideas across to the team, peers and customers.

Job Tags

Full time,

Similar Jobs

Rochester Regional Health

Home Health Aide Training Program (HHA) Job at Rochester Regional Health

 ...Job Title: Home Health Aide Trainee Department: ElderONE Location: Clifton Springs, shared aide building Hours Per Week: 40 Schedule: Monday - Friday; Days/Evenings/Nights RESPONSIBILITIES: Program Participation - Actively participate in class reading... 

Advantage Solutions

Space Planner Job at Advantage Solutions

Schematic Associate At our Company, we grow People, Brands, and Businesses! As part of our commitment to building brands through winning insights for our clients and customers, we are seeking a dynamic Schematic Analyst is a critical role on our Analytics, Insights...

Chick-fil-A

Kitchen Team Member (15/hour) Job at Chick-fil-A

Chick-fil-A Kitchen Team Member Position Type: Part-Time or Full-Time, shifts between 6:00am-5:00pm or 2:00pm-close Responsibilities: - Prepare, cook, and present food safely, quickly and efficiently, meeting Chick-fil-A standards - Be knowledgeable of the assignments...

Express Freight

We need a class A driver extraordinaire Job at Express Freight

We need a class A driver extraordinaireWe are based in Aurora,Co and are a small family owned and operated trucking company looking to grow. We are in need of 3 class A drivers with a good record. This is a good opportunity to get in with a good company that treats its... 

Sarath Hernandez - State Farm Agent

Insurance Staff Position - State Farm Agent Team Member Job at Sarath Hernandez - State Farm Agent

 ...Job Description Job Description Position Overview Successful State Farm Agent is seeking a qualified professional to join their winning team for the role of Insurance Staff Position - State Farm Agent Team Member. We seek an energetic professional interested in helping...