
Site Reliability Engineer
- Warszawa, mazowieckie
- Stała
- Pełny etat
- At least 3 years of hands-on experience managing critical, high-availability production infrastructure, demonstrating success in maintaining reliability and maximizing application uptime.
- Proficient in at least one programming language (such as Python, Java, or Rust), with experience designing and building production-quality automation, tools, or software libraries.
- At least 3 years working with monitoring, log aggregation, and observability platforms such as Datadog, CloudWatch, Honeycomb, Splunk, or New Relic, using data-driven insights to proactively identify and resolve issues.
- Excellent analytical skills with the ability to understand end-to-end use cases, map system flows, debug complex issues, and anticipate potential failure points.
- Proven track record translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you.
- At least 3 years of experience with cloud technologies, in particular AWS Services and tools such as Cloud Formation, Lambda, DynamoDB, SQS, SNS, EC2, S3, AWS CLI, Boto3.
- Solid foundation in Linux systems administration, networking, and security.
- Familiarity with the use and configuration of CI & CD pipelines such as Jenkins & AWS CodePipeline.
- Experience architecting and deploying serverless applications in cloud environments.
- Experience with infrastructure-as-code tools like Terraform or CloudFormation, enabling reproducible and scalable environments.
- Previous participation in production on-call rotations, with direct involvement in incident management and post-incident reviews.
- Demonstrated expertise in performance optimization for core AWS services, including Lambda, DynamoDB, API Gateway, SQS, EventBridge, and EC2.
- Experience supporting and improving systems with frequent, high-velocity deployment cycles.
- Familiarity with security compliance frameworks (e.g., OWASP, ISO, CSA, PCI), and hands-on experience conducting threat assessments and implementing remediation plans.
- Background in security practices, including penetration testing, threat modeling, and usage of both open-source and commercial security tools.
- Experience developing and implementing advanced deployment strategies for web application infrastructures—such as canary, A/B testing, blue/green deployments, or red/line patterns.
- Hands-on experience with chaos engineering—intentionally testing systems under extreme conditions to improve reliability and fault tolerance.
- Track record of championing system reliability, continuous improvement, and operational excellence throughout an organization.
- Screening with recruiter (45 min)
- Technical interview with Hiring Manager (60 min)