Headquarters: Cologne, Germany
We are looking for a Site Reliability Engineer (m/f/d). You will be a key member of a tight-knit group of talented Engineers who are responsible for keeping ours and our customer’s Kubernetes clusters operational and healthy. You’ll also have a key role in the development of the product itself, working together with our Platform Engineers to deliver the greatest Kubernetes service possible.
Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products. We are a diverse, fully remote (since 2014) and experienced team that is growing and spread across Europe - with a headquarters in Cologne.
- You maintain, operate and upgrade our own and our customer’s Kubernetes clusters.
- You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.
- You understand how servers and systems work and you tweak their behavior to your needs.
- You will be responsible for our monitoring, logging and alerting.
- You will help resolve incidents on our own and our customer’s clusters.
- You participate in the on-call support schedule
- You are a go-to person in case our developers need advice regarding infrastructure.
- You will automate all the things, and the thought of Terraform doesn’t make you cry.
- We (and the majority of our customers) are currently mostly distributed around Europe (around UTC), thus, your main time zone should be somewhere between +/-2UTC to ensure better communication.
- You have deep hands-on knowledge of the inner workings of a Kubernetes cluster
- You must be able to configure all cluster components from the ground up with no automated deployment tools (think Kubernetes the Hard Way)
- You’re comfortable debugging systems at all levels, from kernel fundamentals right up to workloads running on Kubernetes.
- You’re happy troubleshooting a wide variety of issues and you’re not afraid to parse thousands of lines of logs in pursuit of an answer.
- You have good coding skills (preferably Go, but Python or similar is fine as well)
- You have experience with maintaining infrastructure with code and you know the pros and cons of various automation tools (We use Terraform & Ansible but Chef, Puppet and the lot is also a good start).
- You are fluent with Cloud Native Tools running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.
- You automate all the things by writing code. Using bash scripts makes you sad :)
Important note: We are not hiring job descriptions. We hire humans. :) We welcome applications from everybody, regardless ethnic or national origin, religion, gender identity, sexual orientation or age.
To apply: https://weworkremotely.com/remote-jobs/giant-swarm-site-reliability-engineer