history_edu

Tata Consultancy Services Ltd

Agentic AI Workflow for Infra Operations

arrow_backBack

August 2025 - Present

Lead AI Engineer - Agentic AI Workflow for Infra Operations (ITIS)

Designed and implemented a complete Agentic-AI automation ecosystem for infrastructure operations, spanning analytics, reporting, incident intelligence, and governance workflows.

70-80%

Manual effort reduction

4 Major Workflows

Automation coverage

Daily / Weekly / Monthly

Reporting cadence

Idempotent + Locked

Execution reliability

Production AI Agents Delivered

Heatmap Analysis AI Agent

  • Built end-to-end on LangGraph using reusable AI nodes and decision tools.
  • Automated recurring issue detection, behavior patterns, and server-level trend analysis.
  • Used schema-based prompts to enforce deterministic JSON outputs.

Patching Analysis AI Agent

  • Validated iPatch data and computed patch compliance KPIs.
  • Correlated patch gaps with incidents to expose operational risk clusters.
  • Built Streamlit UI with Plotly, Matplotlib, and Seaborn visualizations.
  • Delivered automated PDF reports via FPDF with exact-once generation controls.

Incident Analysis AI Agent

  • Enabled contextual retrieval of historical incident resolutions for faster RCA.
  • Implemented governed retrieval workflows for secure customer data access.
  • Automated knowledge discovery across multiple infrastructure towers.

Drive Cleanup Agent

  • Automated filesystem incident classification through rule-based and AI-assisted logic.
  • Standardized ticket assignment and triage patterns across operations teams.

Architecture and Orchestration Engineering

LangGraph Decision Orchestration

  • Designed directed graph workflows with branching, fallback, and decision nodes.
  • Enforced deterministic behavior through schema-controlled tool responses.

Token and Rate-Limit Resilience

  • Built chunking pipelines for report payloads exceeding 128K tokens.
  • Implemented retry, timeout, and sequential processing controls to avoid overlap failures.

Long-Running Workload Handling

  • Introduced spawned subprocess execution for tasks exceeding 240 seconds.
  • Used multithreaded request handling to keep UI responsive under heavy load.

Real-Time Progress Visibility

  • Created live status tracking UI for long-running report execution.
  • Removed user confusion around perceived hung tasks.

Cloud, Storage, and Reporting Automation

Azure Blob Locking, Leasing, and Caching

  • Applied blob leases and locks for safe parallel processing without race conditions.
  • Guaranteed single-writer behavior in concurrent execution scenarios.
  • Added cache-based SAS URL reuse for repeated tower-node-date report requests.
  • Reduced compute overhead and improved response time for repeated report access.

End-to-End Automated Reporting

  • Built daily, weekly, and monthly report pipelines using LangGraph, Python, FPDF, and Azure Blob Storage.
  • Delivered leadership-ready, audit-oriented compliance and infrastructure health summaries.

Critical Issues Resolved

Performance, Stability, and Defect Management

  • Fixed request timeouts beyond 240 seconds using subprocess spawn and threading patterns.
  • Resolved AI rate-limit and token overflow failures with chunking and controlled retries.
  • Eliminated duplicate PDF generation through idempotency and blob lease controls.
  • Validated full UI dialog flow through positive and negative scenario testing.
  • Stabilized RAG, Cosmos DB, and indexing workflows by correcting keys and throttling behavior.
  • Reduced merge-conflict regressions through DRY refactors and stronger GitHub Actions pipelines.

Core Tech Stack

LangChainLangGraphDeterministic JSON Schema PromptsRAG PatternsPython (async / threading / subprocess)StreamlitPlotlySeabornMatplotlibAzure Blob Storage (Locking, Leasing, SAS)FPDFGitHub ActionsCybersecurity Controls

Additional Highlights

  • Led enterprise AI knowledge-sharing sessions on RAG and production engineering patterns.
  • Improved team coding standards through reusable architecture and defensive coding practices.
  • Mentored teammates on debugging workflows and practical AI agent adoption in operations.
  • Delivered scalable, secure, low-defect systems end-to-end from design to production support.