The DevOps team cover all areas of infrastructure, reliability, monitoring, security, CDN configuration, load balancing, and a variety of other technologies that combine to power our cloud-based platform. In addition, they support client services by providing support engineering, and support our engineering teams by maintaining our environments, CI/CD build pipelines, and deploying our releases.
The Tech Stack
In the DevOps team, you will:
- Design, build and implement tools to aid observability, identification and resolution of incidents that occur in live production environments with a strong emphasis on reducing MTTR as metric
- Actively troubleshoot escalated production problems
- Contributing to incident retrospective as someone who is knowledgeable enough to explain what may have occurred at the platform level
- Work with Engineering and Product teams to promote and expand the SRE concepts in both consultancy and hands-on fashion, identifying how it best fits their services and benefits them
- Reduce MTTR by working with other teams to understand their situation and surface and present the right data
- Apply anomaly detection and failure prediction in live environments
- Platform visibility and identifying metrics to base decisions on, sourcing them if we do not record them, and equally explaining which metrics are not that valuable
- See data presentation as socio-technological problem - anyone can create a dashboard, what we need is the most pertinent metrics presented in the most speedily understood human consumable way to affect the MTTR of an incident
- Contribute to capacity projects to recognise issues and changes in traffic before they become impacting
- Contribute to scaling projects, understanding the benefits and risks of scaling architecture on demand and the challenges of achieving it for the differing profiles of our services
- Participate in an On-Call schedule to ensure that our systems are supported at all times.
- Working on supporting Cloud Infrastructure services in AWS and Azure.
- Cyber security ensuring we have a strong defence against attacks and are following best practices
What you'll bring (Essential)..
- SRE concepts such as SLIs SLOs and error budget
- Observability concepts RED/USE
- Strong understanding of HTTP (status codes in detail, nuances of HTTP headers, cookies, connection and request life cycle)
- Strong understanding of TCP, lifecycle, connection and termination scenarios
- Strong understanding of Loadbalancing (HTTP and TCP) and reverse proxy concepts
- Application/service architecture concepts (threads, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff)
- AWS – EC2, S3 and config management, VPC, Networking
- Azure – Service Plans, Blob Storage and config management, Front Door, Networking
- AppInsights – strong knowledge of querying and notifications, graph/dashboard tips and tricks for best human consumption
- Elasticsearch - architecture of nodes, logical flow of a write, logical flow of a read, architecture of indices and shards, some api knowledge, key cluster health/usage metrics, writing queries, aggregations, watches, mappings, schema
- Logstash - tuning, pipelines, writing parsing config that includes some enrichment, health/usage metrics
- Kibana – strong knowledge of object management, graph/dashboard tips and tricks for best human consumption and lowest cost elasticsearch queries, Timelion, ML
- Grafana -graph/dashboard tips and tricks for best human consumption and lowest cost datasource queries
- Generally understand the challenges of long term data retention and its impact to query latency, storage, compute, and data architecture
- Knowledge and experience of database development, both relational and NoSQL
- Desirable to have knowledge and experience of developing for services on Azure.
- Desirable to have some knowledge and experience of search engines such as Azure Search and Elasticsearch
- Familiarity with container orchestration services, especially Kubernetes
- Experience administering and deploying development CI/CD tools
- Significant experience with Windows and Linux operating system environments
- Experienced in using Terraform for infrastructure as code.
- Knowledge and experience of deploying to cloud providers, such as AWS and Azure