10+ Ways To Run Open-Source Models With LlamaIndex - by Wenqi Glantz
10+ Ways To Run Open-Source Models With LlamaIndex - by Wenqi Glantz
10+ Ways To Run Open-Source Models With LlamaIndex - by Wenqi Glantz
Summarize
Member-only story
Chat With This Website
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 1/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Image by DALL-E 3
The CTO of Hugging Face, Julien Chaumond, posted on LinkedIn two weeks ago and
predicted that “Most open source models next year will be better than OpenAI’s”.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 2/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 3/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
we show the sample code snippet with a different LLM supported by that
particular integration method.
We will implement the RAG pipeline with SentenceWindowNodeParser , a tool that can
be used to create representations of sentences that consider the surrounding words
and sentences. During retrieval, before passing the retrieved sentences to the LLM,
the single sentences are replaced with a window containing the surrounding
sentences using the MetadataReplacementNodePostProcessor . This is most useful for
large documents, as it helps to retrieve more fine-grained details.
Let’s first lay out our RAG pipeline’s 5 steps, see the code snippet below.
ServiceContext , the wrapper class for LLM and embedding model and other
hyperparameters, will be a placeholder for now, and we will implement it in various
ways in the following sections.
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['It\'s a Wonderful Life'], auto_suggest=Fal
print(f'Loaded {len(documents)} documents')
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
simple_node_parser = SimpleNodeParser.from_defaults()
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 4/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model
)
nodes = node_parser.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes, service_context=service_context)
query_engine = index.as_query_engine(
similarity_top_k=2,
# the target key defaults to `window` to match the node_parser's default
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window")
]
)
Now that we have our RAG pipeline laid out, let’s dive into the different methods to
call the open-source models to fill in Step 3 above, which defines the llm and
embed_model and constructs service_context .
Hugging Face provides the transformers package to enable access to its open-source
models. We first install transformers[torch] package:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 5/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Now, let’s provide our Hugging Face token. Note that using your token will not
charge you money. LlamaIndex wraps the transformers[torch] package into LLM
entities by its class HuggingFaceLLM for LLMs and HuggingFaceEmbedding for the
embedding models. See the code snippet below.
While executing the above step, we can see the progress bar for model download.
For the batches, they are running sequentially. Note this, as we will compare this
with how vLLM downloads the models in parallel.
With the above definition for llm and embed_model , we get the following output
when asking, “Why did Harry say George is the richest man in town?”:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 6/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Please note the out-of-the-box Zephyr query engine pack runs HuggingFaceH4/zephyr-
7b-beta as its LLM, and BAAI/bge-base-en-v1.5 as its embedding model. You are
welcome to customize the embedding model.
For those new to LlamaPacks, check out my previous article to understand how
LlamaPacks work and how to customize a particular pack.
Inference API comes with two plans: the pro plan and the enterprise plan. Use the
free pro plan with shared infrastructure, which is great for POCs or development
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 7/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
tasks. For production use, switch to the enterprise plan with dedicated inference
endpoints. Custom pricing is based on volume commit, starts at $2k/month, with
annual contracts.
embed_model = HuggingFaceInferenceAPIEmbedding(
model_name="WhereIsAI/UAE-Large-V1", token=HF_TOKEN
)
With the above definition for llm and embed_model , we get the following output
when asking, “Why did Harry say George is the richest man in town?”:
Open in app
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 8/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Note that the Inference API is free, your token is only used for rate limiting. Per
Hugging Face API Inference FAQ:
The free Inference API may be rate limited for heavy use cases. We try to balance the loads
evenly between all our available resources, and favoring steady flows of requests. If your
account suddenly sends 10k requests then you’re likely to receive 503 errors saying models
are loading. In order to prevent that, you should instead try to start running queries
smoothly from 0 to 10k over the course of a few minutes.
Based on the above statement, it’s fair to say that the rate limiting is not much of a
concern for developers working on their POCs or experiments. However, while
working on this POC in Colab, I ran into the rate limiting error for both the LLM and
the embedding model multiple times, which surprised me as my usage was in the
“normal usage” scope. Unfortunately, the detail of the rate limiting is unclear, and I
couldn’t find further information on Hugging Face’s website. My conclusion is that
the rate limiting of the inference API could be a potential issue for developers in
their day-to-day development activities, although the rate limiting should not pose
any obstacle for a quick POC.
Method 3: TextEmbeddingsInference
We explored the Hugging Face text-embeddings-inference server a few months ago.
Details can be found in my article Optimizing Text Embeddings with HuggingFace’s
text-embeddings-inference Server and LlamaIndex.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 9/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
embed_model = TextEmbeddingsInference(
model_name="WhereIsAI/UAE-Large-V1",
base_url = "http://127.0.0.1:8080", # if you have the inference server host
timeout=60, # timeout in seconds
embed_batch_size=10, # batch size for embedding
)
and many other BERT models, poses a limitation on the chunk sizes. This can
potentially impact use cases dealing with large datasets, where larger chunk
sizes can significantly improve parsing performance.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 10/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Source: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
vLLM can be used for both offline inference and online serving.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 11/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
llm = Vllm(
model="HuggingFaceH4/zephyr-7b-beta",
dtype="float16",
tensor_parallel_size=4,
temperature=0,
max_new_tokens=100,
vllm_kwargs={
"swap_space": 1,
"gpu_memory_utilization": 0.5,
"max_model_len": 4096,
},
)
During execution, the difference between how Vllm handles the model download
and how HuggingFaceLLM handles the model download is easily spotted. Vllm can
download the model in parallel; see the download progress bar below.
Vllm also comes with a list of constructor parameters to allow for customization.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 12/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
LlamaIndex also offers the VllmServer class for the integration with online serving.
Once your API server is up and running, you can define your llm by constructing
the VllmServer class, passing in the api_url . Modify the api_url accordingly if the
vLLM server is hosted in the cloud. See the sample code snippet below.
llm = VllmServer(
api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0
)
The VllmServer also comes with a list of constructor parameters you can customize.
Compare Hugging Face Inference API Enterprise Plan with vLLM Online Serving
Both Hugging Face Inference API Enterprise Plan and vLLM Online Serving are
options for deploying your LLM in production. Let’s compare them to find out their
pros and cons.
Pros:
Cons:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 13/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Pros:
Potentially lower cost for specific use cases due to pay-as-you-go pricing.
Cons:
Recommendation:
Choose Hugging Face Inference API Enterprise Plan: If you primarily use
Hugging Face models and value ease of use, managed infrastructure, and robust
security.
Choose vLLM Online Serving: If you have diverse model formats, require fine-
grained control over resources, or cost efficiency is a major concern.
Method 5: Ollama
Ollama supports a list of open-source models available on ollama.ai/library, listed in
the table below.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 14/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Source: https://github.com/jmorganca/ollama
First, set up and run a local Ollama instance by following Ollama readme.
For Windows CPU, I executed the following commands to get the Ollama server up
and running locally, per instructions from Ollama’s docker hub page:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 15/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
the embedding model, since Ollama doesn’t support UAE-Large-V1 , we define the
embed_model so it downloads UAE-Large-V1 locally.
# define llm
llm = Ollama(model="llama2")
response = ollama_pack.run("Why did Harry say George is the richest man in town
Method 6: LlamaCPP
Llama.cpp is an open-source C/C++ library that implements Facebook’s Llama and
other related models like Llama2, Falcon, Alpaca, and GPT4All. It can run on CPUs
and GPUs, making it suitable for various deployment environments.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 16/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
LlamaIndex offers the LlamaCPP class for integration with the llama-cpp-python
library. First we need to install this library:
We define llm by passing in the model_url . See the sample code snippet below.
Note, during execution, this step downloads the specified model Llama-2–13B-chat
locally; it takes a bit of time to download this model, which is over 7GB.
# By default, if model_path and model_url are blank, the LlamaCPP module will l
llm = LlamaCPP(
model_url="https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/ma
}
With the Llama-2–13B-chat as our LLM and we ask our original question, we get the
following output:
The LlamaCPP class comes with a list of constructor parameters you can customize
per your need.
For more info on this integration and sample code snippets, refer to LlamaIndex’s
documentation on LlamaCPP.
Method 7: liteLLM
liteLLM is an open-source Python library that acts as a unified interface for calling a
variety of LLMs through a single, OpenAI-compatible API. This means that you can
use liteLLM to interact with over 100 different LLMs from providers listed below:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 17/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
One of the key benefits of liteLLM is its simplicity. With liteLLM, you can use the
same input/output format to call any of the supported LLMs, regardless of their
underlying provider or API. This can save you a lot of time and effort, especially if
you’re working with multiple LLMs in your project.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 18/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
The cost of using liteLLM depends on the LLM that you’re calling. Each LLM has its
own pricing model, and liteLLM simply passes through the costs that it incurs from
the provider. However, liteLLM itself is free to use.
liteLLM offers the proxy server option. The benefits of a proxy server include:
Load Balancing: between multiple models and deployments of the same model.
liteLLM proxy can handle 1.5k+ requests/second during load tests.
Now, in a Jupyter notebook (cannot run from Google Colab as it needs to access the
locally hosted proxy server), let’s install litellm library:
LlamaIndex offers LiteLLM class for this integration. See below a sample code
snippet:
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 19/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
model="huggingface/meta-llama/Llama-2-7b",
api_key="hf_##################", # HF access token
api_base="http://0.0.0.0:8000" # liteLLM proxy server URL
)
Method 8: Replicate
Replicate is a platform that makes it easy to run inference on various open-source
machine learning models without worrying about setting up and managing your
own infrastructure.
Replicate offers deployment service for both public and private models. You only
pay for what you use on Replicate, billed by the second. When you don’t run
anything, it scales to zero and you don’t pay a thing. For details on the pricing for
public and private models, check out Replicate pricing page.
LlamaIndex offers the Replicate class to integrate with the Replicate platform. You
need to create an API key first. Then install replicate by running the following
command:
You then define your llm by constructing the Replicate class, passing in the model
name in the following format. If you don’t know your model name, search on
Replicate Explore page to find your model name.
import os
from llama_index.llms import Replicate
os.environ["REPLICATE_API_TOKEN"] = "######################"
llm= Replicate(
model="tomasmcm/zephyr-7b-beta:961cd6665b811d0c43c0b9488b6dfa85ff5c7bfb875e
)
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 20/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Method 9: GradientBaseModelLLM
Gradient is the only AI platform that allows you to combine Industry Expert AIs with
your private data. Industry Expert AIs are built on state-of-the-art open-source
LLMs. It’s designed to democratize AI. Models are run remotely on Gradient’s
platform.
LlamaIndex integrates with Gradient through its GradientBaseModelLLM class. See the
code snippet below to construct the llm , you will need to pass in the gradient access
token and workspace id.
os.environ["GRADIENT_ACCESS_TOKEN"] = "####################"
os.environ["GRADIENT_WORKSPACE_ID"] = "#############################"
llm = GradientBaseModelLLM(
base_model_slug="llama2-7b-chat",
max_tokens=400,
)
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 21/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Clarifai: offers both free and paid access to its LLM models, depending on your
usage needs and budget.
Llama API: a hosted API for Llama 2 with function calling support.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 22/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Control: You have control over the model, its configuration, and its data. This
can be crucial for sensitive data or situations where customization is critical.
Privacy: You keep your data on your own infrastructure, reducing the risk of it
being shared or used for unauthorized purposes.
Customization: You can modify and fine-tune the model to your specific needs
and use cases.
Hybrid Approach: It’s totally acceptable to combine local and cloud hosting and
execution of the models. For example, you could run smaller models locally for
quick tasks and POCs, and use a remote server for more demanding tasks such
as production usage.
Cons:
Security: You are responsible for securing your infrastructure and data, which
can be complex and resource-intensive.
Updates: Staying up-to-date with the latest model versions and security patches
can be time-consuming.
Third-party-hosting
Pros:
Convenience: It’s much easier to get started with vendor hosting as they handle
the infrastructure and technical aspects.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 23/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Updates: Third-party automatically updates the model with the latest versions
and security patches.
Cons:
Control: You have less control over the model and its configuration.
Data privacy: Your data may be stored on the vendor’s infrastructure, raising
privacy concerns.
The complexity of the model: More complex models require more technical
expertise for self-hosting.
Your technical resources: Do you have the staff and expertise to manage self-
hosting?
Your data privacy requirements: How sensitive is your data, and how important
is it to keep it in-house?
Your budget: Can you afford the upfront costs and ongoing fees of vendor
hosting?
Overall, the best option for you will depend on your specific needs and resources.
Summary
In this article, we explored the many ways LlamaIndex integrates with open-source
models. The diagrams below summarize the key points of the 10+ ways to run open-
source models with LlamaIndex.
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 24/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 25/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Diagram by author
The source code for this article can be found in my Colab notebook.
I hope you find this article helpful. I welcome any comments or corrections.
Happy coding!
References:
A developer’s guide to open source LLMs and generative AI
Project vLLM
Ollama - Llama 2 7B
LlamaCPP
liteLLM Documentation
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 26/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Follow
Mom, wife, software architect with a passion for technology and crafting quality products
linkedin.com/in/wenqi-glantz-b5448a5a/ twitter.com/wenqi_glantz
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 27/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
291 1
3.7K 27
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 28/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
1.1K 12
194
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 29/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Darren Oberst
65 1
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 30/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
iva vrtaric
Lists
ChatGPT
23 stories · 324 saves
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 31/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
180
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 32/34
12/21/23, 9:28 AM 10+ Ways to Run Open-Source Models with LlamaIndex | by Wenqi Glantz | Dec, 2023 | Level Up Coding
Shivansh Kaushik
17 1
Main stream reports that free LLM are not good for Enterprise use: what if I tell you that we
have plenty of ready to go Models we can run?
286
103
https://levelup.gitconnected.com/10-ways-to-run-open-source-models-with-llamaindex-84fd4b45d0cf 34/34