Support Engineer
Operate the Grovli production stack — keep the AI pipeline, mobile API, and web frontend healthy for paying users.
Support Engineer
We're hiring a support engineer to keep Grovli running reliably for the paying users who depend on it. This is a hybrid role: you'll own production health, drive incident response, and partner with the engineering team on user-reported bugs that slip past automated tests.
About Grovli
Grovli is an AI meal-planning product on iOS and the web. The stack runs on Google Cloud — 8 Cloud Run services, MongoDB, Redis, Firestore vector DB, Vertex AI for Gemini + Imagen, observability via Grafana / Loki / Prometheus / Tempo. We operate as a small team where the difference between "this works" and "users notice latency" gets caught by a person reading dashboards, not a Tier-3 escalation chain.
What you'll do
- Own production observability: Grafana dashboards for backend latency, AI-pipeline cost, meal-generation success rates, sync fastpath hit rate. When a metric drifts, you find the root cause — usually via Tempo traces, Loki logs, or direct Mongo aggregations.
- Triage user-reported issues: paying users hit edge cases the gauntlet doesn't cover. You reproduce, scope, and either patch (for config / data fixes) or hand a self-contained bug report to engineering with the failing trace.
- Drive incident response: when the meal generator drifts, the matcher under-hits, or a wearable integration breaks, you're the person who takes ownership end-to-end — declare the incident, scope blast radius, coordinate the fix, write the postmortem.
- Run health-checks for adjacent systems: Garmin / WHOOP / Withings OAuth health, Vertex AI quota burndown, MongoDB index pressure, Redis memory headroom. Catch problems before users do.
- Improve the gauntlet: when an incident reveals a coverage gap, you write the integration test that would have caught it.
You probably have
- 2+ years operating a real production service (Cloud Run, GKE, EC2, Heroku — any of these are fine, what matters is you've debugged problems live)
- Comfort with logs, traces, metrics — you can navigate a Grafana dashboard or write a Loki LogQL query without a tutorial
- A debugging instinct: you reach for
git log,gcloud logging, and Mongo aggregations before you reach for guesses - Patience with users who are frustrated and clarity in writing back to them
Bonus points
- Python or TypeScript reading-level fluency (you don't have to write features but you should be able to follow a stack trace into source)
- Experience with OpenTelemetry, Grafana, Loki, or similar
- A knack for writing clear incident postmortems
How to apply
Send a resume + a paragraph describing a production incident you led to info@citigrove.com with the subject "Support Engineer". Bonus: share a postmortem you wrote — public or scrubbed — that you're proud of.