When models get large or compute times get slow, the pattern I've seen most is actually two servers with a private API passing data between them. A robust backend server with the model & ML environment installed, and a separate front-end that is much lighter. Get inputs from user, pass as JSON to the backend, run the prediction function, pass results back to front end (as JSON again). When I did mine 3 yrs ago, I used simple S3 for storage and ec2 for compute, each server scaled appropriately. There are prob other ways now, but at the end of the day, all of computing consists of two things: storage & compute, which is exactly what s3 & ec2 are. The most popular providers are: Amazon (AWS, MS (Azure), and Google (GCP). All of them have free trials / credits to get started.


im also looking for an alternative, if anyone knows anything, ill be grateful


Aws Lambda? Not 100% sure


I've never gotten Heroku to work, and they never gave me usable error messages. What I prefer is HuggingFace+Gradio. You can upload large files to it through Git LFS. I'm not sure what their storage limits are, though.


If I remember correctly Heroku only applies Dyno size limits when we are talking about a direct deployment using procfile. if you are however using a docker image they don't seem to be any explicit Dyno size limits involved.


Yeah that's correct. I recently deployed a tensorflow image to Heroku which was more than a gb.


hey Heroku is good for POC’s so in general I would start looking at AWS free tier and storing your model in s3 - then loading it up on fastapi deployed to A) Lambda or B) an EC2 instance and serving that way. under the hood heroku is using aws instances anyway Another option is to convert your TF model to pytorch and reserialize it with Pytorch Lightning https://www.pytorchlightning.ai