Senior/Staff DevOps HPC Engineer
Your work will change lives. Including your own.
The Impact You’ll Make
Recursion is revolutionizing the field of drug discovery by integrating Science and Machine Learning, and we are looking for a Senior/Staff DevOps HPC Engineer to join our pioneering team.
You will play a crucial role in developing and maintaining our HPC systems that power our cutting-edge drug discovery research. You will be responsible for designing, implementing, and managing the infrastructure that supports our machine learning and scientific computing workloads.
Your day-to-day tasks will include building robust and scalable infrastructure, deploying and managing HPC resources, and automating operational processes. You'll apply your deep understanding of DevOps principles and HPC systems to solve complex computational challenges. This means you'll be actively involved in executing high-level computational strategies, tracking crucial processing information, and ensuring high data integrity.
Furthermore, you will collaborate with a diverse team of scientists, machine learning experts, and other engineers to develop a world-class data platform that facilitates the generation and management of petabytes of data, enabling the rapid deployment of new deep learning models into the production data pipeline.
Your contributions will directly impact the efficiency and effectiveness of our drug discovery efforts. You can expect to work on multiple projects at the same time in a fast-paced and stimulating environment.
Your responsibilities will not just be limited to maintaining systems and infrastructure, but will also include proactive troubleshooting, routine system maintenance, ensuring the security of our computing environment, and creating detailed documentation for all processes and procedures. Join us, and make a significant impact on the future of drug discovery. In this role:
- You’ll design, implement, maintain and optimize our Scientific compute, network, and data storage infrastructure and services using an Infrastructure as Code approach across both on-premises and public cloud environments.
- Your technical expertise and leadership will drive innovation across all layers of the HPC/AI infrastructure, ensuring that we provide an effective, scalable platform to support our dynamic scientific workloads.
- Through developing scripts and workflows, you'll automate and verify infrastructure provisioning and dynamic reconfiguration, various repetitive tasks, enhancing our support of the HPC environments. Your attention to detail will be critical in performance analysis, benchmarking, and tuning of our systems and applications.
- Your troubleshooting skills will be invaluable as you resolve application, system, and other technical problems, alongside addressing user tickets swiftly.
- Your role involves researching, deploying, and optimizing workloads and resource scheduling, security, and data lifecycle management policies.
- You will be involved in regularly assessing the health and operational performance of the platform against established metrics, with a view to achieving and improving operational service metrics and targets associated with the platform.
- Lastly, as a lead in technical communication and collaboration with our customers, your efforts will ensure a high level of customer satisfaction. It's your opportunity to make a significant impact in our organization and the wider scientific community.
This position is based at our headquarters in Salt Lake City, Utah or our office in Toronto, Canada, however, we will consider remote work for this position. We ask that remote employees commit to regular on-site visits for routine work and departmental events.
The Team You’ll Join
As a Senior/Staff DevOps HPC Engineer, you will be a part of our dedicated HPC Engineering team, reporting directly to the Associate Director. This dynamic team includes two experienced Senior Engineers, and with the addition of two new roles, including this position, you'll be part of an empowered, cross-functional unit.
Our HPC team works in a fast-paced, collaborative environment, handling a broad spectrum of computational projects. These range from developing advanced, scalable infrastructure to deploying and managing HPC resources and automating operational processes. The team also plays a crucial role in the curation of our vast data platform, which caters to a diverse set of professionals, including biologists, data scientists, and automation engineers.
The HPC team is constantly pushing the boundaries in the field of supercomputing in the TechBio industry. As part of this team, you will collaborate on projects that streamline and optimize our machine learning workflows and scientific computing tasks, driving efficient and transformative solutions within the company. This is a unique opportunity to join a team that thrives on innovation, collaboration, and inclusivity in a role that is pivotal to our mission.
The Experience You’ll Need
- A minimum of 10 years of experience in dealing with HPC infrastructure, preferably in global BioPharma organizations.
- Solid experience with software-defined Infrastructure and cloud computing platforms such as Kubernetes, GCP, AWS, and other.
- Extensive experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments.
- In-depth hands-on experience with the provisioning, configuration, and management of infrastructure through Infrastructure as Code (IaC) and cloud automation principles.
- Python programming and bash scripting experience.
- Proficiency with source control, continuous integration, configuration management, monitoring, and systems tools.
- Practical knowledge of resource management and job scheduling using Slurm and Kubernetes.
- Experience with RDMA-capable high-speed networking.
- Familiarity with parallel file systems and multi-tier file and object storage.
- Proficiency in container technology including Apptainer and docker.
- Experience in building, installing, and supporting user-requested software.
- Strong verbal and written skills for effective communication and documentation.
- Prior experience mentoring, guiding, and cross-training team members.
How You’ll be Supported
- The Onboarding process will include peer knowledge transfer sessions, introductions to key stakeholders, and comprehensive exposure to our company culture and processes.
- You'll have the chance to learn from your colleagues during our regular lunch & learn and tech talk sessions.
- We offer the opportunity to attend courses for certification in new skills or technologies relevant to your role.
- If you're keen to hone your leadership skills, you'll have the option to participate in our coaching sessions like BetterUp.
- To ensure you're always at the forefront of your field, we offer the opportunity to attend conferences.
The Values That We Hope You Share:
- We Care: We care about our drug candidates, our Recursionauts, their families, each other, our communities, the patients we aim to serve and their loved ones. We also care about our work.
- We Learn: Learning from the diverse perspectives of our fellow Recursionauts, and from failure, is an essential part of how we make progress.
- We Deliver: We are unapologetic that our expectations for delivery are extraordinarily high. There is urgency to our existence: we sprint at maximum engagement, making time and space to recover.
- Act Boldly with Integrity: No company changes the world or reinvents an industry without being bold. It must be balanced; not by timidity, but by doing the right thing even when no one is looking.
- We are One Recursion: We operate with a 'company first, team second' mentality. Our success comes from working as one interdisciplinary team.
Recursion spends time and energy connecting every aspect of work to these values. They aren’t static, but regularly discussed and questioned because we make decisions rooted in those values in our day-to-day work. You can read more about our values and how we live them every day here.
More About Recursion
Recursion is a clinical stage TechBio company leading the space by decoding biology to industrialize drug discovery. Enabling its mission is the Recursion OS, a platform built across diverse technologies that continuously expands one of the world’s largest proprietary biological and chemical datasets. Recursion leverages sophisticated machine-learning algorithms to distill from its dataset a collection of trillions of searchable relationships across biology and chemistry unconstrained by human bias. By commanding massive experimental scale — up to millions of wet lab experiments weekly — and massive computational scale — owning and operating one of the most powerful supercomputers in the world, Recursion is uniting technology, biology and chemistry to advance the future of medicine.
Recursion is headquartered in Salt Lake City, where it is a founding member of BioHive, the Utah life sciences industry collective. Recursion also has offices in Toronto, Montreal and the San Francisco Bay Area. Learn more at www.recursion.com, or connect on Twitter and LinkedIn.
Recursion is an Equal Opportunity Employer that values diversity and inclusion. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other characteristic protected under applicable federal, state, local, or provincial human rights legislation.