Senior AI-Native DevOps / Operations Engineer (AMER)

valency.io

valency.io

Software Engineering, Operations, Data Science

Berkeley, CA, USA

Posted on May 19, 2026

Location

Berkeley, CA

Employment Type

Full time

Location Type

Hybrid

Department

Engineering & Technology

About Valency

Valency Systems is a small, dynamic team of engineers, scientists, and researchers building the global hub for the agentic research era.

We're based in Berkeley, California, and we're building something that matters. If you care about open science, advancing research at the speed of thought, and using AI to accelerate discovery, we'd love to talk.

Our team is hybrid. We come together in person 3 days a week, with the option for 2 days of remote work.

The Position

We’re hiring an AI-native DevOps / Operations Engineer to help build and operate the platform behind Valency. This is not a narrow infrastructure maintenance role. We want builders who can design and harden production systems, improve CI/CD and release quality, raise reliability and response times, and create the observability, analytics, and guardrails needed to safely operate a rapidly evolving platform.

This role sits at the intersection of platform engineering, cloud infrastructure, production operations, and AI-era software delivery. You will help close the loop from agentically written software to reliable, performant systems in production. That means better tests, better release controls, stronger guardrails, richer production telemetry, clearer workflows for human approval, and tighter feedback into product and engineering.

This is an especially strong fit for someone who has helped scale high-growth SaaS systems, likes building from first principles, and wants to experience that kind of growth again in a new context.

What You'll Own

  • Design, build, and improve the production platform powering Valency

  • Tighten CI/CD processes so changes are tested, gated, observable, and safe to ship

  • Improve production reliability, latency, deployment safety, and incident response

  • Build the operational feedback loops that help engineering and product teams act on real production behavior

  • Establish the right logging, analytics, tracing, alerting, and workflow instrumentation as the platform scales

  • Define and implement guardrails for agent-involved software delivery and operations

  • Introduce human-in-the-loop approval flows where autonomy needs stronger controls

  • Improve cost efficiency across cloud infrastructure and platform operations

  • Help shape security, compliance, and auditability foundations for SOC 2, ISO 27001, and FedRAMP-oriented environments

  • Contribute to the long-term platform engineering direction as the team grows and specializes

As the senior engineer on-site, you will:

  • Own production operations and operational excellence for this function

  • Lead incident response expectations for the role

  • Establish the operating model the broader team will scale on

  • Work onsite in the SF Bay Area

What Success Looks Like

In the first 6–12 months, you will help Valency begin tracking and materially improve:

  • Deployment frequency and release confidence

  • Change failure rate and rollback quality

  • MTTR and incident handling

  • p95 / p99 latency and system responsiveness

  • Uptime and service reliability

  • Alert quality and signal-to-noise ratio

  • Infrastructure cost efficiency

  • Operational visibility into agent workflows and production behavior

  • Guardrail coverage for agent-authored or agent-assisted changes

What You'll Work With

Today the platform makes use of AWS and adjacent infrastructure including:

  • ECS / Fargate

  • EKS / container orchestration environments

  • RDS

  • S3

  • Cloudflare

  • CloudWatch

  • Queues, caches, schedulers, and batch / background processing systems

We currently use GitHub Actions and expect this person to help evolve that into a stronger long-term platform engineering and delivery foundation

Our observability and analytics stack is still open for innovation. We want someone who is comfortable evaluating the tradeoffs and building the right system as complexity grows.

What Makes This Role AI-Native

This is not “DevOps, but with AI in the title.”

You will help build the operational system around software and workflows that increasingly involve agents. That includes:

  • Tracing workflows across agent-driven and human-driven systems

  • Developing production guardrails to keep systems from going off the rails

  • Designing approval paths for high-risk or high-impact actions

  • Turning production signals into actionable inputs for product and engineering

  • Helping close the loop between what the system is doing, how users experience it, and how the platform should evolve

We do not require prior experience operating AI-native systems at scale. We do require strong judgment, strong production systems experience, and a willingness to build the right AI-era operating model.

Responsibilities

  • Own and improve CI/CD pipelines, release controls, and deployment workflows

  • Build and maintain highly reliable AWS-based production systems

  • Improve observability across logs, metrics, traces, events, and workflow state

  • Instrument platform behavior so system issues, regressions, and slowdowns are quickly visible and actionable

  • Create operational analytics that help close the loop between engineering, product, and customer experience

  • Drive cost engineering and infrastructure efficiency as the system scales

  • Build safer operating patterns for agent-assisted code changes and operational actions

  • Implement testing, validation, approval, and rollback mechanisms that reduce operational risk

  • Improve batch, queue, cache, and job-processing reliability and monitoring

  • Support incident response, root cause analysis, postmortems, and follow-through

  • Partner with external vendors and partners when needed

  • Help define platform standards, reliability practices, and operational maturity across the company

What We're Looking For

Required

  • 8+ years of progressively increasing responsibility operating important production systems

  • Demonstrated success shipping and running high-reliability systems in production

  • Deep AWS experience in real production environments

  • Strong background in software engineering and testing, not just infrastructure administration

  • Experience designing or significantly improving CI/CD systems and release processes

  • Experience building or operating logging, monitoring, alerting, and observability systems

  • Experience improving production reliability, performance, and operational response

  • Comfort with container-based systems and orchestration platforms

  • Strong hands-on ability in at least some of: Python, Go, Elixir, CDK

  • Strong judgment around guardrails, operational safety, and change management

  • Ability to work in ambiguity and build systems that do not yet fully exist

Strongly Preferred

  • Startup experience, especially in fast-scaling environments

  • Experience at high-scale SaaS companies that have gone through periods of rapid growth

  • Experience owning or materially influencing platform engineering functions

  • Experience with cost engineering / FinOps in AWS-heavy environments

  • Experience designing systems for compliance-oriented environments

  • Experience with SOC 2, ISO 27001, or FedRAMP-related operational requirements

  • Experience evaluating or implementing modern observability and workflow tracing stacks

  • Experience creating human-in-the-loop approval systems for sensitive production workflows

Why This Role

  • You will help define how an AI-native research platform is actually operated in production

  • You will work on systems that connect agents, researchers, product behavior, and infrastructure reality

  • You will have broad scope across infrastructure, reliability, analytics, and operational guardrails

  • You will help build the production foundation for a category-defining company at an early stage

  • You will not inherit a frozen stack; you will help choose and build the right one

Compensation, Benefits & Equity
We offer a competitive salary, benefits, and meaningful equity in a company building something important from the ground up.

Work Authorization: Candidates must be legally authorized to work in the United States.