Senior AI-Native DevOps / Operations Engineer (AMER)
valency.io
Software Engineering, Operations, Data Science
Berkeley, CA, USA
Location
Berkeley, CA
Employment Type
Full time
Location Type
Hybrid
Department
Engineering & Technology
About Valency
Valency Systems is a small, dynamic team of engineers, scientists, and researchers building the global hub for the agentic research era.
We're based in Berkeley, California, and we're building something that matters. If you care about open science, advancing research at the speed of thought, and using AI to accelerate discovery, we'd love to talk.
Our team is hybrid. We come together in person 3 days a week, with the option for 2 days of remote work.
The Position
We’re hiring an AI-native DevOps / Operations Engineer to help build and operate the platform behind Valency. This is not a narrow infrastructure maintenance role. We want builders who can design and harden production systems, improve CI/CD and release quality, raise reliability and response times, and create the observability, analytics, and guardrails needed to safely operate a rapidly evolving platform.
This role sits at the intersection of platform engineering, cloud infrastructure, production operations, and AI-era software delivery. You will help close the loop from agentically written software to reliable, performant systems in production. That means better tests, better release controls, stronger guardrails, richer production telemetry, clearer workflows for human approval, and tighter feedback into product and engineering.
This is an especially strong fit for someone who has helped scale high-growth SaaS systems, likes building from first principles, and wants to experience that kind of growth again in a new context.
What You'll Own
Design, build, and improve the production platform powering Valency
Tighten CI/CD processes so changes are tested, gated, observable, and safe to ship
Improve production reliability, latency, deployment safety, and incident response
Build the operational feedback loops that help engineering and product teams act on real production behavior
Establish the right logging, analytics, tracing, alerting, and workflow instrumentation as the platform scales
Define and implement guardrails for agent-involved software delivery and operations
Introduce human-in-the-loop approval flows where autonomy needs stronger controls
Improve cost efficiency across cloud infrastructure and platform operations
Help shape security, compliance, and auditability foundations for SOC 2, ISO 27001, and FedRAMP-oriented environments
Contribute to the long-term platform engineering direction as the team grows and specializes
As the senior engineer on-site, you will:
Own production operations and operational excellence for this function
Lead incident response expectations for the role
Establish the operating model the broader team will scale on
Work onsite in the SF Bay Area
What Success Looks Like
In the first 6–12 months, you will help Valency begin tracking and materially improve:
Deployment frequency and release confidence
Change failure rate and rollback quality
MTTR and incident handling
p95 / p99 latency and system responsiveness
Uptime and service reliability
Alert quality and signal-to-noise ratio
Infrastructure cost efficiency
Operational visibility into agent workflows and production behavior
Guardrail coverage for agent-authored or agent-assisted changes
What You'll Work With
Today the platform makes use of AWS and adjacent infrastructure including:
ECS / Fargate
EKS / container orchestration environments
RDS
S3
Cloudflare
CloudWatch
Queues, caches, schedulers, and batch / background processing systems
We currently use GitHub Actions and expect this person to help evolve that into a stronger long-term platform engineering and delivery foundation
Our observability and analytics stack is still open for innovation. We want someone who is comfortable evaluating the tradeoffs and building the right system as complexity grows.
What Makes This Role AI-Native
This is not “DevOps, but with AI in the title.”
You will help build the operational system around software and workflows that increasingly involve agents. That includes:
Tracing workflows across agent-driven and human-driven systems
Developing production guardrails to keep systems from going off the rails
Designing approval paths for high-risk or high-impact actions
Turning production signals into actionable inputs for product and engineering
Helping close the loop between what the system is doing, how users experience it, and how the platform should evolve
We do not require prior experience operating AI-native systems at scale. We do require strong judgment, strong production systems experience, and a willingness to build the right AI-era operating model.
Responsibilities
Own and improve CI/CD pipelines, release controls, and deployment workflows
Build and maintain highly reliable AWS-based production systems
Improve observability across logs, metrics, traces, events, and workflow state
Instrument platform behavior so system issues, regressions, and slowdowns are quickly visible and actionable
Create operational analytics that help close the loop between engineering, product, and customer experience
Drive cost engineering and infrastructure efficiency as the system scales
Build safer operating patterns for agent-assisted code changes and operational actions
Implement testing, validation, approval, and rollback mechanisms that reduce operational risk
Improve batch, queue, cache, and job-processing reliability and monitoring
Support incident response, root cause analysis, postmortems, and follow-through
Partner with external vendors and partners when needed
Help define platform standards, reliability practices, and operational maturity across the company
What We're Looking For
Required
8+ years of progressively increasing responsibility operating important production systems
Demonstrated success shipping and running high-reliability systems in production
Deep AWS experience in real production environments
Strong background in software engineering and testing, not just infrastructure administration
Experience designing or significantly improving CI/CD systems and release processes
Experience building or operating logging, monitoring, alerting, and observability systems
Experience improving production reliability, performance, and operational response
Comfort with container-based systems and orchestration platforms
Strong hands-on ability in at least some of: Python, Go, Elixir, CDK
Strong judgment around guardrails, operational safety, and change management
Ability to work in ambiguity and build systems that do not yet fully exist
Strongly Preferred
Startup experience, especially in fast-scaling environments
Experience at high-scale SaaS companies that have gone through periods of rapid growth
Experience owning or materially influencing platform engineering functions
Experience with cost engineering / FinOps in AWS-heavy environments
Experience designing systems for compliance-oriented environments
Experience with SOC 2, ISO 27001, or FedRAMP-related operational requirements
Experience evaluating or implementing modern observability and workflow tracing stacks
Experience creating human-in-the-loop approval systems for sensitive production workflows
Why This Role
You will help define how an AI-native research platform is actually operated in production
You will work on systems that connect agents, researchers, product behavior, and infrastructure reality
You will have broad scope across infrastructure, reliability, analytics, and operational guardrails
You will help build the production foundation for a category-defining company at an early stage
You will not inherit a frozen stack; you will help choose and build the right one
Compensation, Benefits & Equity
We offer a competitive salary, benefits, and meaningful equity in a company building something important from the ground up.
Work Authorization: Candidates must be legally authorized to work in the United States.