Companies you'll love to work for

Cloud Operations & Site Reliability Engineer (SRE)

ScyllaDB

ScyllaDB

Software Engineering, Operations
Asia · Remote
Posted on Dec 24, 2024

Description

ScyllaDB is seeking experienced and dynamic individuals to join our Cloud Operations & Site Reliability Engineering (SRE) team. As a Scylla Cloud Operations & SRE Engineer, you will play a vital role in maintaining the operational excellence of our cutting-edge NoSQL database platform, Scylla Cloud. Leveraging your expertise in cloud infrastructure, Kubernetes, and system operations, you will ensure the reliability, scalability, and performance of our cloud offerings. If you are passionate about working in a fast-paced environment, collaborating with cross-functional teams, and driving continuous improvement, this role is tailored for you.

Applicants for this position should be able to start their workday between 21:00 GMT and 01:00 GMT.

Responsibilities:

  • Collaborate with the Cloud Operations & SRE team to ensure the smooth day-to-day operation of Scylla Cloud. Monitor system health, troubleshoot issues, and proactively address any operational challenges.
  • Assist and perform upgrades for Scylla Cloud, including Scylla database versions, OS upgrades, and security patches. Collaborate with DevOps/Cloud Engineering to ensure seamless upgrade processes.
  • Participate in scaling up and down Scylla Monitor & Scylla Managers servers based on demand. Employ proactive monitoring strategies to identify and address potential performance bottlenecks and resource constraints.
  • Act as a liaison with the Support Organization to address cloud platform-related issues. Respond to tasks and tickets escalated by Support Staff, and collaborate to ensure timely resolutions.
  • Develop and maintain a comprehensive runbook that can be leveraged by Support Staff to troubleshoot and resolve common issues, improving efficiency in issue resolution.
  • Create scripts and automation solutions to streamline operational tasks and enhance efficiency. Contribute to the development of automation strategies for cloud infrastructure management.
  • Feature Requests: Collaborate with the Cloud Engineering team to define and create feature requests that enhance the functionality and performance of Scylla Cloud.
  • Conduct regular cluster health and performance audits, identifying areas for optimization. Implement strategies to enhance the efficiency and reliability of Scylla Cloud clusters.
  • Work closely with the Customer Success team to ensure that provisioned resources align with customer needs and purchased packages. Provide insights into potential scaling opportunities and usage optimization.
  • Demonstrate a deep understanding of public cloud environments (AWS, GCP, Azure), Kubernetes, Linux system operations, and NoSQL database deployment/management. Apply this knowledge to resolve complex technical challenges.
  • Utilize scripting languages like Python, Terraform, Ansible and Bash to create automation tools that enhance operational efficiency.
  • Cross-Functional Collaboration: Collaborate closely with Support and Engineering teams to address issues, drive improvements, and implement customer-focused solutions.

Requirements

  • 3+ years of experience in public cloud platforms (AWS, GCP, Azure).
  • 3+ years of Linux system operations and metrics analysis.
  • Availability to begin work between 9:00 PM and 1:00 AM GMT.
  • Strong scripting skills in Python and Bash.
  • Experience with reporting and visualization tools such as Splunk, Grafana, Prometheus, and Kibana.
  • Excellent written and verbal English communication skills.
  • Exceptional organizational skills and ability to manage multiple projects concurrently.
  • Ability to work both independently and collaboratively within cross-functional teams.
  • Strong problem-solving skills, especially under pressure.
  • Eagerness to continuously learn and adapt to emerging technologies.
  • Familiarity with container technologies like Docker and Kubernetes.
  • Familiarity within automation tools such as Ansible and Terraform.

Nice to Have:

  • Proficiency with automation tools such as Ansible and Terraform.
  • 3+ years of Kubernetes experience.
  • Proven expertise in NoSQL database deployment, management, and data modeling.

If you are passionate about contributing to the success of ScyllaDB's cloud offerings and thrive in a dynamic and collaborative environment, we invite you to join our Cloud Operations & SRE team. Your technical expertise, problem-solving skills, and dedication will play a crucial role in ensuring the reliability and performance of Scylla Cloud for our global customer base.