Support Engineer

We're hiring a support engineer to keep Grovli running reliably for the paying users who depend on it. This is a hybrid role: you'll own production health, drive incident response, and partner with the engineering team on user-reported bugs that slip past automated tests.

About Grovli

Grovli is an AI meal-planning product on iOS and the web. The stack runs on Google Cloud — 8 Cloud Run services, MongoDB, Redis, Firestore vector DB, Vertex AI for Gemini + Imagen, observability via Grafana / Loki / Prometheus / Tempo. We operate as a small team where the difference between "this works" and "users notice latency" gets caught by a person reading dashboards, not a Tier-3 escalation chain.

What you'll do

Own production observability: Grafana dashboards for backend latency, AI-pipeline cost, meal-generation success rates, sync fastpath hit rate. When a metric drifts, you find the root cause — usually via Tempo traces, Loki logs, or direct Mongo aggregations.
Triage user-reported issues: paying users hit edge cases the gauntlet doesn't cover. You reproduce, scope, and either patch (for config / data fixes) or hand a self-contained bug report to engineering with the failing trace.
Drive incident response: when the meal generator drifts, the matcher under-hits, or a wearable integration breaks, you're the person who takes ownership end-to-end — declare the incident, scope blast radius, coordinate the fix, write the postmortem.
Run health-checks for adjacent systems: Garmin / WHOOP / Withings OAuth health, Vertex AI quota burndown, MongoDB index pressure, Redis memory headroom. Catch problems before users do.
Improve the gauntlet: when an incident reveals a coverage gap, you write the integration test that would have caught it.

You probably have

2+ years operating a real production service (Cloud Run, GKE, EC2, Heroku — any of these are fine, what matters is you've debugged problems live)
Comfort with logs, traces, metrics — you can navigate a Grafana dashboard or write a Loki LogQL query without a tutorial
A debugging instinct: you reach for git log, gcloud logging, and Mongo aggregations before you reach for guesses
Patience with users who are frustrated and clarity in writing back to them

Bonus points

Python or TypeScript reading-level fluency (you don't have to write features but you should be able to follow a stack trace into source)
Experience with OpenTelemetry, Grafana, Loki, or similar
A knack for writing clear incident postmortems

How to apply

Send a resume + a paragraph describing a production incident you led to info@citigrove.com with the subject "Support Engineer". Bonus: share a postmortem you wrote — public or scrubbed — that you're proud of.