Only AI Jobs


AI/ML Engineering Manager

ID: 9250

Type: Full-time

Category: Others

Company Name: Caylent

Location: Brazil

Education Level: Expert & Leadership (>10 years)

Visit company vacancy
Job Description

Caylent is a cloud native services company that helps organizations bring the best out of their people and technology using Amazon Web Services (AWS). We provide a full-range of AWS services including workload migrations and modernization, cloud native application development, DevOps, data engineering, security and compliance, and everything in between.

At Caylent, our people always come first.  We are a global company and operate fully remote with employees in Canada, the United States, and Latin America. We celebrate the culture of each of our team members and foster a community of technological curiosity. Come talk to us to learn more about what it means to be a Caylien!

The Mission

This is a senior role for someone who leads from both directions at once — deeply technical on customer engagements, and fully accountable for the growth and performance of a team of ML engineers and architects. You will report to the Director of AI/ML.

You own hiring, development, and team health alongside leading complex customer engagements, shaping architecture, and driving pre-sales. Both parts of this job are real and ongoing. The right candidate will find energy in that combination, not tension.

Your Assignment

Leading Your Team
  • Hire and build: Set the technical bar for ML roles on your team, lead or oversee technical assessments, and make hiring decisions you can stand behind. Build a team that raises the practice's overall standard.
  • Develop people: Run regular structured 1:1s, provide candid feedback at meaningful milestones, and actively invest in each person's growth — whether they are early in their career or highly experienced.
  • Manage performance: Recognize strong contributors and address performance gaps directly and early. Partner with HRBPs and the Director of AI/ML when situations require a structured path, and advocate for your team when they deserve it.
  • Stay close to staffing: Understand how your team is utilized across engagements, keep the staffing team informed of each person's skills evolution and preferences, and ensure people are placed in work that stretches them appropriately.
Strategic Advisory
  • Lead ML assessments: Evaluate customer environments end-to-end — infrastructure, data pipelines, model lifecycle, and organizational readiness — and produce recommendations that drive executive decisions and open the door to the next engagement.
  • Shape architecture: Serve as the senior technical authority on engagements, setting architectural direction, ensuring technical quality across the team, and making the calls that matter when tradeoffs are hard.
  • Advise on ML operations: Help customers build ML systems they can actually own and sustain — translating MLOps, LLMOps, and production monitoring complexity into standards their engineering teams can execute and their leadership can act on.
  • Drive pre-sales: Partner with sales and solutions teams during scoping and proposal phases, contributing the technical depth needed to scope work accurately and give prospects confidence in Caylent's ability to deliver.
Hands-On Delivery
  • Lead engagements end-to-end: Drive architecture and solution design from kickoff through delivery — setting technical direction, unblocking the team on hard problems, and ensuring the work meets Caylent's quality standards.
  • Own the technical relationship: Depending on the engagement, you are either the primary client contact owning all architect-level outcomes, or the senior technical authority providing oversight across the team. The expectation is the same in both cases — you are the person the engagement depends on technically.
Growing the Practice
    • Raise the bar internally: Mentor engineers and architects through real work, contribute to technical interviews, and build reference architectures and accelerators that make the broader ML practice better.

 

Your Qualifications (non-negotiables)

  • 10+ years in machine learning or AI, with a proven track record of leading client-facing engagements in a consulting or advisory capacity.
  • Demonstrated people management experience — hiring, performance calibration, career development, and the ability to have difficult conversations directly and constructively.
  • Deep, current knowledge of the AWS ML and GenAI ecosystem, with the ability to make and defend architectural decisions across the full ML lifecycle — from data and feature engineering through training, deployment, and monitoring.
  • Deep expertise in at least two or three ML domains — whether classical ML, computer vision, NLP, time series, or others — combined with the judgment to assess, architect, and advise across the broader ML landscape.
  • Proven ability to architect and govern production ML systems end-to-end, translating MLOps, LLMOps, and broader AI operations complexity into standards that engineering teams can execute and executives can act on.
  • Deep expertise across foundation model adaptation — fine-tuning (LoRA, QLoRA, PEFT), alignment (RLHF, DPO), inference optimization, and distributed training — combined with RAG and agentic system design, including multi-agent architectures, MCP integration, and human-in-the-loop patterns on AWS.
  • Proven ability to operate independently in complex, ambiguous customer environments — navigating competing priorities, aligning stakeholders, and translating ML tradeoffs into business risk and value for both technical and executive audiences.
Strong differentiators
  • AWS Certified Machine Learning – Specialty and/or AWS Certified Solutions Architect – Professional.
  • Experience shaping practice-level standards, reference architectures, and reusable ML accelerators across multiple engagements.
  • Exposure to varied industries and problem types in a consulting or client-facing context.
  • Deep fluency in responsible AI practices — model evaluation, bias detection, fairness frameworks, and AI governance — applied in enterprise deployments.
  • Fluency in AIOps patterns — designing agentic workflows for anomaly detection, automated root cause analysis, and remediation across observability platforms — and the ability to translate AI operations outcomes into measurable business value for customers.

Technical Stack

Our practice spans a broad range of ML domains. Candidates are expected to prescribe — not just recognize — with the judgment to maximize what AWS makes possible and the experience to know how open-source tooling strengthens it.

  • ML Domains: Classical ML, Computer Vision, NLP, Generative AI & LLMs, AI Agents & Autonomous Systems, Intelligent Document Processing, Video Understanding, Speech & Audio, Time Series & Forecasting, Recommender Systems, Graph ML, Reinforcement Learning, Multimodal AI
  • AWS ML Platform: SageMaker, SageMaker Pipelines, SageMaker Feature Store, SageMaker Model Registry, SageMaker Clarify, Bedrock (Agents, Knowledge Bases, Guardrails, AgentCore, Model Evaluation)
  • Multi-provider LLM: Bedrock, Anthropic API, OpenAI API, Google Gemini API, Azure OpenAI — with the judgment to reason across provider tradeoffs in enterprise contexts
  • AWS AI Services: Rekognition, Comprehend, Transcribe, Textract, Translate, Personalize, Neptune, Kinesis Video Streams, Polly
  • Data Platform: Apache Spark / PySpark, Apache Kafka, Amazon Kinesis, Apache Iceberg, Delta Lake, Apache Hudi, AWS Glue
  • Vector Databases: Pinecone, pgvector, Amazon OpenSearch (vector), Weaviate
  • Frameworks: PyTorch, TensorFlow, JAX, Scikit-learn, XGBoost, HuggingFace (Transformers, PEFT, TRL), LangChain, LlamaIndex, DSPy, Ollama
  • MLOps & Governance: MLflow, W&B, Airflow / MWAA (data orchestration), Dagster (asset-based pipelines), Kubeflow Pipelines, CI/CD, IaC (CloudFormation, CDK, Terraform), Docker, Kubernetes, ML Governance (lineage, data contracts, audit), Responsible AI / Bias & Fairness
  • LLM Evaluation & Safety: RAGAS, LLM-as-judge patterns, DeepEval, NeMo Guardrails, Constitutional AI patterns, structured output validation
  • Inference & Optimization: Triton, vLLM, SGLang, Trainium, Inferentia, Quantization (GPTQ, AWQ, bitsandbytes), SageMaker Neo

Benefits

  • 100% remote work
  • Flexible Time Off
  • Competitive phantom equity
  • Paid for exams and certifications
  • Peer bonus awards
  • State of the art laptop and tools
  • Equipment & Office Stipend
  • Individual professional development plan
  • Annual stipend for Learning and Development
  • Work with an amazing worldwide team and in an incredible corporate culture 
 
This role may require up to 25% travel, depending on business needs.

NOTE: We’re unable to provide visa sponsorship now or at any time in the future.

At Caylent, we are committed to fair, transparent, and inclusive hiring practices. As part of our recruitment process, we may use artificial intelligence (AI) tools or automated systems to assist with the screening and evaluation of applications to help match candidate qualifications with job requirements.
These tools are designed to support — not replace — human decision-making. Final hiring decisions are always made by our trained recruitment professionals.
If an AI or automated tool is used during your application process, it will only be in accordance with applicable laws and regulations, and your information will be handled in a secure and confidential manner.
If you have any questions, please contact talent@caylent.com 

Caylent is a place where everyone belongs. We celebrate diversity and are committed to creating an inclusive environment for all employees. Our approach helps us to build a winning team that represents a variety of backgrounds, perspectives, and abilities. So, regardless of how your diversity expresses itself, you can find a home here at Caylent.  

We are proud to be an equal opportunity employer. We prohibit discrimination and harassment of any kind based on race, color, religion, national origin, sex (including pregnancy), sexual orientation, gender identity, gender expression, age, veteran status, genetic information, disability, or other applicable legally protected characteristics. If you would like to request an accommodation due to a disability, please contact us at hr@caylent.com.
Company Information

Company Name: Caylent

Company Website: https://www.caylent.com

Company Address: San Diego, CA, USA

Caylent is a technology services firm that provides cloud-native engineering, managed cloud operations, and professional services to help organizations design, build, migrate, and operate applications and infrastructure in public cloud environments. The company concentrates on practical, outcome-driven engagements that combine cloud architecture, DevOps/Site Reliability Engineering (SRE) practices, automation, and platform engineering to accelerate software delivery while improving reliability, security, and cost efficiency. Core business activities - Cloud architecture and migration: Caylent helps customers assess existing on-premises or legacy environments, plan cloud migration strategies, and implement cloud-native architectures. Typical work includes lift-and-shift and replatforming efforts as well as deeper refactoring to enable microservices, containers, and serverless patterns where appropriate. The company emphasizes automated, repeatable approaches to migration to reduce risk and improve predictability. - Containerization and Kubernetes: A key area of focus is containerization and orchestration using Kubernetes and related tooling. Caylent assists with container strategy, cluster design and provisioning, workload migration to Kubernetes (including Amazon EKS and other managed Kubernetes offerings), and operational runbooks for day‑to‑day management. - Infrastructure as Code and automation: Caylent implements infrastructure as code (IaC) to create reproducible, version-controlled infrastructure deployments. Work commonly involves tools and frameworks for declarative provisioning and configuration management, CI/CD pipeline integration, automated testing, and GitOps-style delivery patterns to ensure infrastructure and application changes are auditable and repeatable. - Managed cloud services and SRE: Beyond project work, Caylent provides managed services to operate customer cloud environments. This includes 24/7 operational support, incident response, monitoring and observability configuration, performance tuning, and ongoing cost and security optimization. Caylent applies SRE principles to reduce toil and drive service reliability through automation and error-budget-based priorities. - DevOps, CI/CD and developer enablement: The company works with engineering teams to modernize and automate software delivery lifecycles—setting up CI/CD pipelines, artifact management, automated testing, and release orchestration. The aim is to improve developer productivity and accelerate time-to-market while maintaining stability and compliance requirements. - Security, compliance and governance: Caylent helps embed security and governance controls into cloud architectures and delivery workflows. Typical engagements include threat modeling, securing CI/CD pipelines, policy-as-code and guardrails, identity and access management design, and support for industry-relevant compliance needs and auditing practices. - Observability, monitoring and cost optimization: Caylent configures observability stacks, logging, metrics and tracing to provide actionable visibility into application and infrastructure behavior. They also perform cloud cost analysis and implement cost-optimization strategies to help organizations manage and reduce cloud spend. Main products and services Caylent’s offerings are organized around professional services engagements and ongoing managed service arrangements. Professional services include assessments, cloud migration programs, application modernization projects, platform engineering engagements, and implementation of CI/CD and infrastructure automation. Managed services typically cover continuous operations of cloud environments, SRE-driven support, incident management, monitoring and alerts, patching and lifecycle management, and proactive optimization. Engagement model and delivery approach Caylent usually engages in a consultative model, starting with discovery and assessment to define desired outcomes, followed by a prioritized plan that blends short-term wins with longer-term platform and process improvements. Project teams are typically cross-disciplinary—combining cloud architects, DevOps/SRE engineers, security specialists, and project managers—to deliver end-to-end solutions. The company emphasizes knowledge transfer and developer enablement so client teams can adopt and extend new platforms and practices. Technology emphases In executing projects, Caylent routinely works with public cloud providers and open-source cloud-native technologies. Typical technical areas include Amazon Web Services (AWS) and managed services such as EKS for Kubernetes, Microsoft Azure and Google Cloud Platform where appropriate, containerization technologies (Docker, Kubernetes), IaC tools (Terraform and similar), CI/CD systems, observability tooling (metrics, logs and tracing platforms), and security and policy tools that integrate into automated delivery pipelines. Typical customers and outcomes Caylent serves a range of customers from growth-stage startups to enterprise teams seeking to modernize legacy systems or improve cloud operations. Common outcomes reported from engagements include accelerated delivery cycles, reduced operational overhead through automation, improved system reliability and observability, clearer cost visibility and lower cloud spend, and enhanced security posture through automated controls. Public positioning and market role Caylent positions itself as a partner for organizations that need both hands-on engineering execution and long-term operational capabilities in the cloud. By combining professional services with managed operations and platform engineering, Caylent aims to help customers move faster while maintaining robust operational practices. The company communicates a focus on measurable results—such as improved deployment frequency, reduced mean time to recovery, and lower operational cost—by applying cloud-native patterns and automation. Note: The foregoing description focuses on Caylent’s publicly described capabilities in cloud-native engineering, DevOps/SRE, and managed cloud services. For the most current, specific service names, partner certifications, office locations, or client case studies, consult Caylent’s official site and materials.
Visit company vacancy