Infrastructure Engineer at Gremlin
- Location: Remote within US
Employees at Gremlin have the unique chance to be at the forefront of a new IT practice, Chaos Engineering, just as the industry is beginning to adopt it. Our team is growing and you have an opportunity to join us as we transform how companies like Target, Twilio, National Bank of Australia, and many more build reliable software.
Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.
Chaos Engineering helps engineers test large computer systems to discover system problems before those problems cause outages. This is a foundational approach for every company that provides mission-critical, always-online services. Gremlin provides a platform that makes the practice much faster to implement and easier to use for both large and small enterprise customers.
About the Role of the Infrastructure Engineer
The role of Infrastructure Engineer is responsible for designing, implementing, and maintaining our AWS environment and supporting systems; assessing risks and vulnerabilities in Gremlin infrastructure and related processes; development of platform infrastructure; educating users on related policy and standard methodologies and evangelizing these policies and processes; and assisting with controls to ensure compliance with industry standards. The ideal candidate will be able to function in an engineering driven multi-disciplinary environment, be self-motivated, and able to work independently as part of a distributed team.
- Design, develop, and maintain AWS systems and services
- Automate and maintain CI/CD systems
- Perform infrastructure design and implementation
- Participate in incident response and on-call rotations
- Perform internal auditing of Gremlin infrastructure and external services
- Work directly with development teams to establish and implement designs for Gremlin infrastructure and related development processes
- Conduct incident prevention, detection, containment, and eradication across Gremlin systems by enhancing processes, monitoring events, responding to incidents, and summarize and reporting findings
Key Skills and Attributes
If you don’t think you meet all of the criteria below but still are interested in the job, please apply. Nobody checks every box we’re looking for candidates that are particularly strong in a few areas, and have some interest and capabilities in others.*
- Autonomous, self-starter who is motivated to go above and beyond the call of duty
- Experience with controls for securing containers and container orchestration platforms (Kubernetes)
- Familiarity with identity and access management concepts (IAM & RBAC)
- Familiarity with AWS best practices and frameworks
- Familiar with AWS Cloudwatch, Cloudtrail, GuardDuty, Cloudformation, and other configuration and monitoring tools
- Experience with GC, and/or Azure is a plus
- Ability to effectively promote priorities throughout the organization and coordinate closely with engineering resources
- Experience working with security engineers to design, and implement information security and privacy controls in AWS and dependent services such as Datadog
- Experience maintaining testing frameworks for infrastructure systems
- Ability to research, recommend, and implement controls to improve usability of infrastructure systems
Our founders, Kolton Andrus and Matthew Fornaciari, lived and breathed incidents, on-call, and Chaos Engineering at Amazon and Netflix. As Call Leaders, they were responsible for guiding teams through analyzing and resolving global outages. After a decade of developing and advocating Chaos Engineering internally, in 2016 they decided to make what they had learned available to a wider set of enterprise companies and launched Gremlin.
Since then, Gremlin has built an incredible team of industry veterans and people eager to learn from one another while pushing the entire industry forward to new heights. We’re backed by top-tier investors Index Ventures, Amplify Partners, and Redpoint Ventures. Our customers love us, and we’re thrilled to be a partner in their success.
At Gremlin, we value:
- OUR CUSTOMERS – We won’t be a company if our customers aren’t thrilled. We live and die by our customers, so they come first.
- ACTION – We favor small experiments to gather data rather than over-analyzing a situation. Getting stuff done always beats talking about getting stuff done.
- CONTEXT, NOT CONTROL – We hire autonomous adults with good judgement. We provide them with the context to make smart decisions. We don’t micromanage.
- BEING VOCALLY SELF-CRITICAL – We all make mistakes, we all have ways in which we can improve. We own that upfront, and honestly discuss ways in which we’ve personally made mistakes and can get better. Then, we encourage and help one another succeed at doing so.
- DIVERSITY, EQUITY, & INCLUSION – We are at our best when we encourage and include the thoughts and voices of people from many diverse backgrounds into our strategy and execution. We recognize that systemic racism and gender bias are real and that we aren’t perfect, so we actively work to encourage the difficult conversations, to listen, and to change as we discover our blind spots so that Gremlin is a company all of us feel proud to be a part of.
- FRUGALITY – We are working to build a profitable company and create a new practice in the industry. We spend money on the right things, like making sure employees have the tools they need to be successful and the company has what it needs; we simply choose not to waste what we have and not to buy what we don’t actually need.
You are welcome at Gremlin for who you are. The more voices and ideas we have represented in our business, the more we will all flourish, contribute, and build a more reliable internet. Gremlin is a place where everyone can grow and is encouraged. However you identify and whatever background you bring with you, please apply if this sounds like a role that would make you excited to come into work everyday. It’s in our differences that we will find the power to keep building a more reliable internet by building and designing tools used by the best companies in the world.
Sign up for Daily Remote Job Alerts!