Must Have: * 5+ years of experience as a DevOps / SRE / Infrastructure engineer. * Proven experience managing large-scale SaaS systems on AWS (EKS, RDS, Kafka, Redis, S3, Lambda, CloudWatch). * Deep understanding of Kubernetes architecture and container orchestration at scale, Karpenter. * Hands-on experience with Terraform, Helm, and CI/CD automation (GitHub Actions, Jenkins, or ArgoCD). * Strong scripting skills in Python, Bash, or Go. * Familiarity with monitoring and alerting tools (Prometheus, Grafana, Loki, ELK). * Experience using or integrating AI-assisted tools (e.g., for observability, auto-remediation, or developer productivity). * Excellent troubleshooting skills and a proactive mindset for reliability and performance optimization
Nice to Have: * Experience in multi-environment / multi-tenant SaaS or cybersecurity / threat intelligence systems. * Knowledge of AI/ML pipelines or AIOps concepts.. * Background in cost optimization and FinOps practices. * Familiarity with Kafka scaling, Redis clustering, and AWS service-level tuning.
Your day-to-day in this position: * Be a key player in scaling and modernizing a global cyber intelligence SaaS serving leading enterprises. * Collaborate with top-tier engineers and architects driving automation and intelligent operations. * Take ownership and lead initiatives that directly affect uptime, reliability, and efficiency. * Work in an environment that encourages innovation, experimentation, and adoption of AI and automation in day-to-day operations. * Lead the DevOps domain: define architecture, automation strategy, and reliability goals for the entire R&D organization. * Own infrastructure scalability and performance: ensure our Kubernetes (EKS)-based environments are resilient, efficient, and cost-optimized. * Develop and maintain CI/CD pipelines using GitHub Actions, Jenkins, or ArgoCD to support fast, reliable, and automated delivery. * Drive observability and reliability initiatives: monitor system health via Prometheus, Grafana, and CloudWatch; define metrics, alerts, and SLOs. * Leverage AI/automation tooling (e.g., anomaly detection, alert classification, cost prediction) to enhance monitoring, response, and efficiency. * Manage infrastructure as code (Terraform, Helm, CloudFormation) and enforce IaC best practices. * Collaborate with engineering teams to design infrastructure for new services, improve developer experience, and ensure secure deployments. * Ensure system uptime and production readiness: lead root cause analysis, incident response, and capacity planning. * Continuously evaluate emerging technologies, including AI-driven ops tools, to improve scalability, reliability, and delivery velocity.
Why work with us? * People-first management with minimal bureaucracy * A friendly company culture, proven by employees who choose to return * Flexible working hours * 29 days of PTO (18 working days per year pluse all national holidays) * 10 paid recovery days * Full financial and legal support for independent contractors * Free English classes, with native speakers or Ukrainian teachers * Dedicated HR support
What is your new project? * Domain: Computer and network security * Location: Israel * Company size: 51-200 employees * Founded in: 2009
Our next steps:
✅ Intro call with a Recruiter — ✅ Home assignment — ✅ Client intro interview — ✅ Tech interview — ✅ HR client interview — ✅Reference check — ✅ Offer