(Almost) Every infrastructure decision I endorse or regret after 4 years running infrastructure at a startup ยท Jack's home on the web

The Nugget

  • Endorsement and regret, balanced by experience: Choosing AWS, embracing EKS, and prioritizing managed services like RDS and Redis ElastiCache have been pivotal infrastructure decisions, while certain approaches like EKS managed addons and Datadog's cost model have been sources of regret.
  • These reflections derive from practical, long-term engagement with startup infrastructure scalability and efficiency, offering a nuanced view of choices that can propel or hinder a startup's technical backbone.

Key insights

Infrastructure Choices: AWS and Kubernetes

  • Choosing AWS: AWS was preferred over Google Cloud due to superior support and a customer-focused approach. This decision was reinforced by AWS's stability and integration facilities, despite the initial Kubernetes uncertainties.
  • Adopting EKS: Running Kubernetes via EKS was endorsed for its balance between functionality and maintenance ease, overcoming any cost reservations in favor of time and resource efficiency.

Managed Services: RDS and Redis

  • Data Management with RDS: The critical nature of data integrity and disaster recovery positioned RDS as a non-negotiable, evidently beneficial choice despite its higher costs.
  • Redis ElastiCache: The adoption of Redis ElastiCache was praised for its performance and flexibility, underlining its role as a versatile data handling tool.

Reflections on Tools and Approaches

  • Mixed feelings on EKS managed addons and Datadog: The move away from EKS managed addons toward helm charts and the financial burden of Datadog highlight the importance of customization and cost effectiveness.
  • Embracing GitOps and Flux for Kubernetes: GitOps has been validated as an effective strategy for managing infrastructure, with Flux particularly noted for its successful application in a Kubernetes environment.

Process and Efficiency

  • Streamlining with Automation: Automating processes, such as the post-mortem process with a Slack bot, proved essential in maintaining efficiency and accountability within the operational framework.
  • Alerting and Incident Management: The development of a structured alerting framework and the use of PagerDuty templates have been key to managing incidents effectively.

Key quotes

  • "Data is the most critical part of your infrastructure. You lose your network: thatโ€™s downtime. You lose your data: thatโ€™s a company ending event."
  • "any system with a lot of ways to use has a lot of ways to use wrong."
  • "Nobody runs a service at 100% CPU utilization and moves on with their life."
  • "Just lean into an identity solution early on and only accept SaaS vendors that integrate with it."

Make it stick

  • ๐Ÿ›  AWS over GCP: Personal touch with AWS provides unmatched support and guidance.
  • ๐Ÿš€ EKS over hand-rolled Kubernetes: Managed Kubernetes services save time and shield teams from the complexities of cluster management.
  • ๐ŸŽ’ Choosing managed services like RDS and Redis: Favor managed services for critical operations to ensure reliability and disaster recovery.
  • ๐Ÿ“š GitOps and Flux: Embrace GitOps for infrastructure management to enhance visibility and control over deployment processes.
This summary contains AI-generated information and may be misleading or incorrect.