A web interface and API for the LLaMA large language AI model, based on the llama.cpp runtime.
doc | ||
.gitignore | ||
api.go | ||
cflags_linux_amd64.go | ||
cflags_linux_arm64.go | ||
ggml.c | ||
ggml.h | ||
go.mod | ||
go.sum | ||
LICENSE | ||
llama.cpp | ||
llama.h | ||
main.go | ||
README.md | ||
webui.go |
llamacpphtmld
A web interface and API for the LLaMA large language AI model, based on the llama.cpp runtime.
Features
- Live streaming responses
- Continuation-based UI
- Supports interrupt, modify, and resume
- Configure the maximum number of simultaneous users
- Works with any LLaMA model including Vicuna
- Bundled copy of llama.cpp, no separate compilation required
Usage
All configuration should be supplied as environment variables:
LCH_MODEL_PATH=/srv/llama/ggml-vicuna-13b-4bit-rev1.bin \
LCH_NET_BIND=:8090 \
LCH_SIMULTANEOUS_REQUESTS=1 \
./llamacpphtmld
Use the GOMAXPROCS
environment variable to control how many threads the llama.cpp engine uses.
API usage
The generate
endpoint will live stream new tokens into an existing conversation until the LLM stops naturally.
- Usage:
curl -v -X POST -d '{"Content": "The quick brown fox"}' 'http://localhost:8090/api/v1/generate'
- You can optionally supply
ConversationID
andAPIKey
string parameters. However, these are not currently used by the server. - You can optionally supply a
MaxTokens
integer parameter, to cap the number of generated tokens from the LLM.
License
MIT
Changelog
2023-04-09 v1.1.0
- New web interface style, that is more mobile friendly and shows API status messages
- Add default example prompt
- Use a longer n_ctx by default
2023-04-08 v1.0.0
- Initial release