Yeah it is pretty nice. Not sure how long it took, but less that the time to make a sandwich (2 minutes). It cost 2-3c a pop so sadly more expensive than GPT3.5. However maybe it can be optimised. Or maybe there is some init cost that could be store in state.
(modal) fme:/mnt/c/temp/modal$ modal run openllama.py
? Initialized. View app at https://modal.com/apps/ap-9...
? Created objects.
+-- ?? Created download_models.
+-- ?? Created mount /mnt/c/temp/modal/openllama.py
+-- ?? Created OpenLlamaModel.generate.
+-- ?? Created mount /mnt/c/temp/modal/openllama.py
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s]
Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00, 6.23s/it]
Building a website can be done in 10 simple steps:
1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting
? App completed.
2-3c per run seems very high. That's probably just the cost if you have to spin up a new container. You can shorten the idle timeout on a container if its going to just serve one request typically. If it's going to serve more requests, then the startup and idle shutdown cost is amortized over more requests :)
I found this was the cost per call to a web function. I used deploy to deploy it. The function just does what the main did in the example repo (earlier in this theead)
https://github.com/modal-labs/modal-examples/blob/main/06_gp...