Published about 1 month ago


Headquarters: US
URL: https://knack.com

We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.

 Key Responsibilities 
  • Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
  • Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
  • Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
  • Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
  • Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
  • Serve as technical lead for deep dives to identify solutions to prevent future incidents
  • Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability

 Skills Knowledge and Expertise 
  • Expertise in AWS
  • Expertise with RDS, preferably Aurora PostgreSQL engine
  • Expertise with containerization
  • Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
  • Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
  • Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
  • Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
  • Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals

 Our Stack 

Our stack is evolving over the next year and we’d love you to be a part of that! 
Currently we’re using:
  • Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
  • Data: Aurora PostgreSQL, Redis, ElasticSearch
  • DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
  • Testing: Playwright, Mocha, Jest
  • Front-end: Vue.js, Webpack, SCSS

To apply: https://weworkremotely.com/remote-jobs/knack-senior-site-reliability-engineer