# xllamacpp
**Repository Path**: xorbitsai/xllamacpp
## Basic Information
- **Project Name**: xllamacpp
- **Description**: xllamacpp - a Python wrapper of llama.cpp
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: bld/arm64_cuda_build
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 0
- **Created**: 2025-07-01
- **Last Updated**: 2025-08-07
## Categories & Tags
**Categories**: ai
**Tags**: None
## README

# xllamacpp - a Python wrapper of llama.cpp
[](https://pypi.org/project/xllamacpp/)
[](https://github.com/xorbitsai/inference/blob/main/LICENSE)
[](https://discord.gg/Xw9tszSkr5)
[](https://twitter.com/xorbitsio)
This project forks from [cyllama](https://github.com/shakfu/cyllama) and provides a Python wrapper for @ggerganov's [llama.cpp](https://github.com/ggerganov/llama.cpp) which is likely the most active open-source compiled LLM inference engine.
## Compare to llama-cpp-python
The following table provide an overview of the current implementations / features:
| implementations / features | xllamacpp | llama-cpp-python |
| :--------------------------- | :-----------: | :--------------: |
| Wrapper-type | cython | ctypes |
| API | Server & Params API | Llama API |
| Server implementation | C++ | Python through wrapped LLama API |
| Continuous batching | yes | no |
| Thread safe | yes | no |
It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!
## Wrapping Guidelines
As the intent is to provide a very thin wrapping layer and play to the strengths of the original c++ library as well as python, the approach to wrapping intentionally adopts the following guidelines:
- In general, key structs are implemented as cython extension classses with related functions implemented as methods of said classes.
- Be as consistent as possible with llama.cpp's naming of its api elements, except when it makes sense to shorten functions names which are used as methods.
- Minimize non-wrapper python code.
## Install
- From pypi for `CPU` or `Mac`:
```sh
pip install -U xllamacpp
```
- From github pypi for `CUDA` (use `--force-reinstall` to replace the installed CPU version):
```sh
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
```
- From github pypi for `HIP` AMD GPU (use `--force-reinstall` to replace the installed CPU version):
```sh
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/rocm-6.2.4
```
## Setup
To build `xllamacpp`:
1. A recent version of `python3` (testing on python 3.12)
2. Git clone the latest version of `xllamacpp`:
```sh
git clone git@github.com:xorbitsai/xllamacpp.git
cd xllamacpp
git submodule init
git submodule update
```
3. Install dependencies of `cython`, `setuptools`, and `pytest` for testing:
```sh
pip install -r requirements.txt
```
4. Type `make` in the terminal.
## Testing
The `tests` directory in this repo provides extensive examples of using xllamacpp.
However, as a first step, you should download a smallish llm in the `.gguf` model from [huggingface](https://huggingface.co/models?search=gguf). A good model to start and which is assumed by tests is [Llama-3.2-1B-Instruct-Q8_0.gguf](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf). `xllamacpp` expects models to be stored in a `models` folder in the cloned `xllamacpp` directory. So to create the `models` directory if doesn't exist and download this model, you can just type:
```sh
make download
```
This basically just does:
```sh
cd xllamacpp
mkdir models && cd models
wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
```
Now you can test it using `llama-cli` or `llama-simple`:
```sh
bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
-p "Is mathematics discovered or invented?"
```
You can also run the test suite with `pytest` by typing `pytest` or:
```sh
make test
```