We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.
- Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
- Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
- Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
- Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
- Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
- Serve as technical lead for deep dives to identify solutions to prevent future incidents
- Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability
Skills Knowledge and Expertise
- Expertise in AWS
- Expertise with RDS, preferably Aurora PostgreSQL engine
- Expertise with containerization
- Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
- Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
- Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
- Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
- Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals
Our stack is evolving over the next year and we’d love you to be a part of that!
Currently we’re using:
Data: Aurora PostgreSQL, Redis, ElasticSearch
DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
Testing: Playwright, Mocha, Jest
Front-end: Vue.js, Webpack, SCSS
To apply: https://weworkremotely.com/remote-jobs/knack-senior-site-reliability-engineer