View in browser

Reduce The Costs Of Training & Running Cloud-Based ML Models

We all wish we had infinite budgets to play around with AI. 💸Raising money isn’t easy and finance teams are always looking for ways to cut expenses.

But just because training and running ML models costs a 💰fortune, it doesn’t mean it’s out of your reach.

In fact, there are a few things you can do to lower your cloud bill and boost your resource utilization to get an edge over those who don’t subscribe to Kubernetes IRL.

1. Pick the right infrastructure for running your ML project

If you’re planning an ML project, this is the best time to compare different cloud providers and their pricing models for compute and managed AI services (AWS SageMaker, Google Cloud Vertex AI, and Azure ML Studio).

Take your time to explore various instance families – In AWS, you’re looking at P3, G3, G4, and G5. Google offers A2, G2, and N1. Azure has a big range of machines from the NCv3-series to the freshly released NDm A100 v4-series.

Pick the right instance for the job – Analyze your workload requirements and the needs of the specific AI models across all compute dimensions, including CPU (vs. GPU), Memory, SSD, and network connectivity.

Keep up with novelties – Cloud providers are constantly launching new instance types for inference, such as Amazon EC2 Inf2, powered by AWS Inferentia2, the second-generation AWS Inferentia accelerator. According to AWS, Inf2 raises the performance of Inf1 with 3x higher compute performance, 4x larger total accelerator memory, and up to 4x higher throughput.

Use spot instances for non-critical tasks – Take advantage of the incredible savings post instances/preemptible VMs offer for non-critical tasks that are interruption-tolerant, like model development or experimentation. Running batch jobs at an 80-90% discount just makes sense.

ℹ️Tip: Some automation solutions use spot interruption prediction models to predict upcoming reclaims and move compute to other instances before interruption happens.

2. Optimize resource utilization

No matter if you’re working on ML or running an e-commerce application, overprovisioning is a common issue in Kubernetes. It gets even more wasteful when you overprovision GPU instances, which cost multiple times as much as CPU-powered VMs.

Here are a few things you can do here to use the resources you pay for smarter:

Monitor resource consumption using tools like Grafana or CAST AI – tools like that give you access to real-time ML lifecycle job consumption data for full transparency. You can see workload efficiency metrics that instantly tell you whether your workload is really using the capacity you set for it.

Use autoscaling – Kubernetes has three autoscaling mechanisms available, and you can use third-party tools that improve horizontal and vertical scaling significantly.

Implement bin packing – there’s also tooling available to modify the behavior of the Kubernetes scheduler in a way that maximizes node utilization, by packing in as many pods into nodes as possible and avoiding the scenario of running multiple underutilized nodes.

Remove idle resources – what’s the point of paying for cloud resources when you’re not running any experiments? Don’t leave these machines hanging idle. You can use an automation solution to remove them as soon as there are no jobs to run.

Implement cost-saving policies and guidelines within your organization – To control cloud costs, you need all hands on deck. Spreading awareness of how much resources cost among engineers is the greatest challenge to the entire FinOps project, not only trying to ship efficient AI models.

3. Optimize data storage and transfer

Another smart move is looking at your data storage and transfer fees. If you move data from one availability zone to another, you might end up paying a pretty high egress fee on every move.

How do you reduce the costs of AI model training? Here are a few tips:

Compress data – Do it before storing or transferring it. The Parquet file format is very useful when training more traditional ML models that use tabular data. By switching from CSV to Parquet, you can save ~10x of storage and gain quite a few performance benefits thanks to the columnar Parquet format.

Cache intermediate results – This is how you can minimize data transfer fees and expedite AI model inference. Vector databases are a key innovation here. They have entirely changed how we use exponential amounts of unstructured data in object storage. This method is use-case specific but works really well for LLM-based applications.

Implement data lifecycle policies – This opens the door to automatically managing storage costs, which is a particular gain for large AI models. Example: a data retention policy specifying when a team can remove data and if not, where the data will be stored after a specified amount of time (like moved to cold storage).

I hope these tips help you save up on running ML in the cloud.

Cheers,

Allen

Found this email useful? Forward it to your friends and colleagues who need more Kubernetes best practices in their lives!