SRE-led RunOps optimize and secure cloud operations for a GenAI platform at a multinational corporation
A leading multinational corporation building a GenAI platform across Azure, AWS, GCP, and hybrid environments faced challenges managing a complex application with 200+ services deployed across multiple environments.
Key challenges included- a lack of standardization in design, security, and deployment, significant observability gaps, scalability and automation limitations, and managing global stakeholders. Initially designed on Azure, the platform was aimed to be cloud-agnostic for deployment across public clouds or on-premise, while addressing 70+ cloud misconfigurations, 90+ critical CVEs (Common Vulnerabilities and Exposures), and gaps in observability and FinOps.
Hitachi Digital Services delivered a cloud-agnostic platform for Azure, AWS, GCP, or hybrid setups.
- Implemented RunOps with Terraform, OpenTelemetry for 150+ microservices.
- Accelerated cloud adoption via a cloud acceleration program and introduced SRE-led operations for efficient management.
- Enabled 360-degree observability with Azure Monitor, Log Analytics, and Elastic Stack.
- Automated deployment artifacts, alert templates, and dashboards. Enhanced FinOps reporting, security through automation, and ITSM tool implementation.
The client achieved measurable improvements in operational efficiency, scalability, and cost optimization.
- 30% productivity improvement through streamlined operations and optimized workflows.
- 35% reduction in Total Cost of Operations (TCO) through improved FinOps and automated resource management.
- Reusable Assets: Over 50 reusable assets, including SOPs, runbooks, templates, and standards, ensuring consistent performance for Day 1 operations.
- Improved Observability: Closed 70+ observability gaps and automated dashboards, significantly enhancing incident resolution.
- Resolution of 70+ cloud misconfigurations, 90+ critical CVEs, and 20% FinOps gaps.
30% productivity improvements
35% reduction in Total Cost of Operation (TCO)
Future-proofed platform scalability