Meta released LLama2 as an open-source model, which means they made the training code and the GitHub project available to the public. What's more, the trained models were also released for free, allowing everyone to download and utilize them as they see fit, including hosting and deploying. However, deploying the model necessitates the availability of multiple GPUs, which can be quite expensive.
Several cloud providers have already integrated the LLama2 model into their deployment options. For instance, AWS offers deployment through SageMaker, and HuggingFace supports it as well, both of which we will discuss in this section. Local deployment or inference is also feasible, as mentioned earlier, but it does require one or more GPUs. In another section, we will delve into how to use LLama2 locally with Python, and even explore techniques for using it without GPUs.
AWS Sagemaker
Sagemaker has already integrated Llama2 as a model in the studio enviroment. It is possible to deploy the model as an endpoint with only a few clicks. To do that just:
1. Go to the “studio” section in Sagemaker
2. In the “Get Started” box, select a domain and click in “Open Studio”
3. Select a Llama2 instance
a. In the studio environment, select “Model, notebooks solutions” section in the sidebar, under “JumpStart” section
b. Select the desired Llama-2 model. The Llama-2-7b-chat model is the recommended starting choice.
4. Click on the “deploy” option
5. Once the model is deployed you can go to the “Endpoint” section in the sidebar, under the “Deployments” section. You should see the deployed endpoint there with “In service” status
Be aware that the models are deployed with a GPU instance, so depending on your AWS plan the instances could be available or not. In general there is no problem deploying the 7b parameters model, it only needs an 1 gpu instance, but the bigger ones with 13b or 70b parameters may require bigger instances.
Outside SageMaker you can invoke the endpoint using boto3 invoke_endpoint method.
- This Python example asumes that you are already authenticated in AWS with AWS CLI login. You have to use the endpoint name to invoke it
CODE: https://gist.github.com/santit96/a8fe644b6ba5e7805d3f7b0a45ede38d.js?file=invoke_llama2_endpoint_aws.py
HuggingFace
Llama2 is also available in HuggingFace, and it is possible to deploy an inference endpoint that serves it. To do that first you have to ask Meta access to the model, usually they last one day to give access. Then you or your organization have to add a valid payment method to the Hugging Face account. Then:
- Go to the model card and select “deploy inference endpoint”
- Wait around 5 minutes
- Invoke the endpoint at the provided url:
CODE: https://gist.github.com/santit96/a8fe644b6ba5e7805d3f7b0a45ede38d.js?file=invoke_llama2_endpoint_huggingface.sh
Costs
Llama2 allows us to use the model for free; however, the infrastructure used to deploy and run the model is not. Deploying Llama on serverless inference in AWS or another platform to use it on-demand could be a cost-effective alternative, potentially more affordable than using the GPT API. Unfortunately, GPU serverless inference is not available, and the containers for CPU inference can have up to 6GB of RAM. In contrast, LLAMA2, when quantized with 8 bits, requires at least 10GB, and with 4 bits, it uses 6.2GB.
Without serverless inference, Llama2 can only be used in production on a running instance, which could be a HuggingFace or AWS endpoint, an EC2 instance, or an Azure instance. On average, these instances cost around $1.5 per hour.
Taking all this information into account, it becomes evident that GPT is still a more cost-effective choice for large-scale production tasks. It offers quick responses with minimal effort by simply calling an API, and its pricing is quite competitive.
While it is possible to use quantized Llama2 with limited resources on a CPU, as we will explore later, response times can be considerably high in most cases. While this is acceptable for testing, it may not meet the real-time response needs of production applications.
Conclusion
Deploying the LLama2 model for text generation can be done through various options, including AWS SageMaker and HuggingFace. These platforms provide convenient ways to deploy and utilize the model as an endpoint. However, it's important to note that deploying the model may require the availability of multiple GPUs, which can be costly.
When considering the costs involved, it's essential to take into account the infrastructure expenses for running and deploying the model. While LLama2 itself is available for free, the infrastructure required to support it may incur costs, especially when using GPU instances.
In conclusion, deploying LLama2 for text generation offers flexibility and accessibility through platforms like AWS SageMaker and HuggingFace. However, it's vital to carefully consider the associated costs and compare them to the cost of GPT-3 API, which maybe still convenient due to the robustness of the tool and the affordability of its API.