D

Senior / Principal Site Reliability Engineer

DataCrunch
Full-time
Remote
United States
Information Technology

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

About the role

We’re seeking a Senior or Principal Site Reliability Engineer (SRE) to become our first U.S. hire, based in the Bay Area. This is a pivotal role as we expand our operations across the West Coast. You’ll work closely with our European engineering teams to scale our high-performance compute (HPC) and cloud infrastructure globally. As our initial U.S.-based engineer, you’ll set the standard for reliability, automation, and operational excellence.

Why DataCrunch
  • Generous cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.).
  • Profitable operations, in addition to fast growth.
  • Role that offers plenty of space to both make a business-critical impact and become a QA team lead or an engineer.
  • Small yet mighty team of 65, challenging the status quo to positively impact the lives of many people.
  • 27 nationalities in total, with 6 different ones in the management team.
Practicalities
  • Work mode: Remote (with plans to open our first U.S. office next year)
  • Seniority level: Senior
  • Employment type: Full-time, permanent
Your responsibilities
  • Ensure the reliability, scalability, and performance of HPC and cloud systems.

  • Build and maintain automation, observability, and monitoring frameworks for compute clusters.

  • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems.

  • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes.

  • Participate in architecture design and long-term infrastructure strategy discussions.

  • Help establish local infrastructure and contribute to the setup of our future San Francisco office.

  • Play a key role in recruiting and mentoring as our U.S. team grows.

Your key competencies
  • 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.

  • Linux expertise (Ubuntu or Debian preferred).

  • Strong experience with scripting and automation (Python, Go, Bash).

  • Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).

  • Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible).

  • Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.

  • Familiarity with ML model training environments.

  • Understanding of Kubernetes (nice to have)

How the process looks like


  1. Intro chat with our Talent Acquisition Partner - an initial online conversation to learn more about you and share details about the role.

  2. Technical assignment - a short task (around 15 minutes) to understand your approach and problem-solving style.

  3. Online technical interview with the Hiring Manager - a deeper discussion about your technical experience and ways of working.

  4. In-person interview with one of our team members - a chance to get to know the team and our culture.

  5. Final interview with our CTO & CEO – to align on vision and  expectations.

Apply now
Share this job