SRE-led RunOps optimize and secure cloud operations for a GenAI platform at a multinational corporation
Case Study

SRE-led RunOps optimize and secure cloud operations for a GenAI platform at a multinational corporation

Leveraging Site Reliability Engineering (SRE) for GenAI platform administration while accelerating cloud adoption strategy
Challenge

A leading multinational corporation building a GenAI platform across Azure, AWS, GCP, and hybrid environments faced challenges managing a complex application with 200+ services deployed across multiple environments.

 

Key challenges included- a lack of standardization in design, security, and deployment, significant observability gaps, scalability and automation limitations, and managing global stakeholders. Initially designed on Azure, the platform was aimed to be cloud-agnostic for deployment across public clouds or on-premise, while addressing 70+ cloud misconfigurations, 90+ critical CVEs (Common Vulnerabilities and Exposures), and gaps in observability and FinOps.

Solution

Hitachi Digital Services delivered a cloud-agnostic platform for Azure, AWS, GCP, or hybrid setups.

 

  • Implemented RunOps with Terraform, OpenTelemetry for 150+ microservices.
  • Accelerated cloud adoption via a cloud acceleration program and introduced SRE-led operations for efficient management.
  • Enabled 360-degree observability with Azure Monitor, Log Analytics, and Elastic Stack.
  • Automated deployment artifacts, alert templates, and dashboards. Enhanced FinOps reporting, security through automation, and ITSM tool implementation.
Result

The client achieved measurable improvements in operational efficiency, scalability, and cost optimization.

 

  • 30% productivity improvement through streamlined operations and optimized workflows.
  • 35% reduction in Total Cost of Operations (TCO) through improved FinOps and automated resource management.
  • Reusable Assets: Over 50 reusable assets, including SOPs, runbooks, templates, and standards, ensuring consistent performance for Day 1 operations.
  • Improved Observability: Closed 70+ observability gaps and automated dashboards, significantly enhancing incident resolution.
  • Resolution of 70+ cloud misconfigurations, 90+ critical CVEs, and 20% FinOps gaps.
Key benefits include:

30% productivity improvements

35% reduction in Total Cost of Operation (TCO)

Future-proofed platform scalability