Senior / Principal Site Reliability Engineer

DataCrunch

Full-time

Remote

United States

Information Technology

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

About the role

We’re seeking a Senior or Principal Site Reliability Engineer (SRE) to become our first U.S. hire, based in the Bay Area. This is a pivotal role as we expand our operations across the West Coast. You’ll work closely with our European engineering teams to scale our high-performance compute (HPC) and cloud infrastructure globally. As our initial U.S.-based engineer, you’ll set the standard for reliability, automation, and operational excellence.

Why DataCrunch

Generous cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.).
Profitable operations, in addition to fast growth.
Role that offers plenty of space to both make a business-critical impact and become a QA team lead or an engineer.
Small yet mighty team of 65, challenging the status quo to positively impact the lives of many people.
27 nationalities in total, with 6 different ones in the management team.

Practicalities

Work mode: Remote (with plans to open our first U.S. office next year)
Seniority level: Senior
Employment type: Full-time, permanent

Your responsibilities

Ensure the reliability, scalability, and performance of HPC and cloud systems.
Build and maintain automation, observability, and monitoring frameworks for compute clusters.
Collaborate with ML, data, and infrastructure teams to deliver high-availability systems.
Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes.
Participate in architecture design and long-term infrastructure strategy discussions.
Help establish local infrastructure and contribute to the setup of our future San Francisco office.
Play a key role in recruiting and mentoring as our U.S. team grows.

Your key competencies

7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.
Linux expertise (Ubuntu or Debian preferred).
Strong experience with scripting and automation (Python, Go, Bash).
Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).
Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible).
Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.
Familiarity with ML model training environments.
Understanding of Kubernetes (nice to have)

How the process looks like

Intro chat with our Talent Acquisition Partner - an initial online conversation to learn more about you and share details about the role.
Technical assignment - a short task (around 15 minutes) to understand your approach and problem-solving style.
Online technical interview with the Hiring Manager - a deeper discussion about your technical experience and ways of working.
In-person interview with one of our team members - a chance to get to know the team and our culture.
Final interview with our CTO & CEO – to align on vision and expectations.

Apply now

Share this job

Senior / Principal Site Reliability Engineer

About the role

Why DataCrunch

Practicalities

Your responsibilities

Your key competencies

How the process looks like

More jobs

Java Mid-Level Engineer (Client-Facing/Hands-On)

Diné Development Corporation

SC Cleared Full Stack Developer (T)

Gstsolutions