Job type

Permanent

Location

London

DevOps @ATG

The DevOps team cover all areas of infrastructure, reliability, monitoring, security, CDN configuration, load balancing, and a variety of other technologies that combine to power our cloud-based platform. In addition, they support client services by providing support engineering, and support our engineering teams by maintaining our environments, CI/CD build pipelines, and deploying our releases.

The Tech Stack

Azure cloud, Powershell, Visual Studio Team Services (VSTS), .NET Framework, .NET Core, C#, HTML5/CSS3, Javascript, React, SignalR, ASP.NET MVC, SQL Server, MySQL, Redis, AWS, RDS, Cosmos DB, Docker, Git, VSTS, Octopus

In the DevOps team, you will:

- Design, build and implement tools to aid observability, identification and resolution of incidents that occur in live production environments with a strong emphasis on reducing MTTR as metric

 - Actively troubleshoot escalated production problems

 - Contributing to incident retrospective as someone who is knowledgeable enough to explain what may have occurred at the platform level

- Work with Engineering and Product teams to promote and expand the SRE concepts in both consultancy and hands-on fashion, identifying how it best fits their services and benefits them

- Reduce MTTR by working with other teams to understand their situation and surface and present the right data

- Apply anomaly detection and failure prediction in live environments

- Platform visibility and identifying metrics to base decisions on, sourcing them if we do not record them, and equally explaining which metrics are not that valuable

- See data presentation as socio-technological problem - anyone can create a dashboard, what we need is the most pertinent metrics presented in the most speedily understood human consumable way to affect the MTTR of an incident

- Contribute to capacity projects to recognise issues and changes in traffic before they become impacting

- Contribute to scaling projects, understanding the benefits and risks of scaling architecture on demand and the challenges of achieving it for the differing profiles of our services

- Participate in an On-Call schedule to ensure that our systems are supported at all times.

- Working on supporting Cloud Infrastructure services in AWS and Azure.

- Cyber security ensuring we have a strong defence against attacks and are following best practices

What you'll bring (Essential)..

  • SRE concepts such as SLIs SLOs and error budget
  • Observability concepts RED/USE
  • Strong understanding of HTTP (status codes in detail, nuances of HTTP headers, cookies, connection and request life cycle)
  • Strong understanding of TCP, lifecycle, connection and termination scenarios
  • Strong understanding of Loadbalancing (HTTP and TCP) and reverse proxy concepts
  • Application/service architecture concepts (threads, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff)
  • AWS – EC2, S3 and config management, VPC, Networking
  • Azure – Service Plans, Blob Storage and config management, Front Door, Networking
  • AppInsights – strong knowledge of querying and notifications, graph/dashboard tips and tricks for best human consumption
  • Elasticsearch - architecture of nodes, logical flow of a write, logical flow of a read, architecture of indices and shards, some api knowledge, key cluster health/usage metrics, writing queries, aggregations, watches, mappings, schema
  • Logstash - tuning, pipelines, writing parsing config that includes some enrichment, health/usage metrics
  • Kibana – strong knowledge of object management, graph/dashboard tips and tricks for best human consumption and lowest cost elasticsearch queries, Timelion, ML
  • Grafana -graph/dashboard tips and tricks for best human consumption and lowest cost datasource queries
  • Generally understand the challenges of long term data retention and its impact to query latency, storage, compute, and data architecture
  • Knowledge and development experience with .NETCore, Javascript, TDD, Rest API
  • Knowledge and experience of database development, both relational and NoSQL
  • Desirable to have knowledge and experience of developing for services on Azure.
  • Desirable to have some knowledge and experience of search engines such as Azure Search and Elasticsearch
  • Familiarity with container orchestration services, especially Kubernetes
  • Experience administering and deploying development CI/CD tools
  • Significant experience with Windows and Linux operating system environments
  • Experienced in using Terraform for infrastructure as code.
  • Knowledge and experience of deploying to cloud providers, such as AWS and Azure

Apply for this role

(Maximum file size of 10 MB. File types accepted: pdf or doc)