2023-04-08 03:30:37 +00:00
|
|
|
# llamacpphtmld
|
|
|
|
|
|
|
|
A web interface and API for the LLaMA large language AI model, based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) runtime.
|
|
|
|
|
|
|
|
## Features
|
|
|
|
|
|
|
|
- Live streaming responses
|
2023-04-08 04:04:55 +00:00
|
|
|
- Continuation-based UI
|
|
|
|
- Supports interrupt, modify, and resume
|
2023-04-08 03:30:37 +00:00
|
|
|
- Configure the maximum number of simultaneous users
|
|
|
|
- Works with any LLaMA model including [Vicuna](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit)
|
|
|
|
- Bundled copy of llama.cpp, no separate compilation required
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
All configuration should be supplied as environment variables:
|
|
|
|
|
|
|
|
```
|
|
|
|
LCH_MODEL_PATH=/srv/llama/ggml-vicuna-13b-4bit-rev1.bin \
|
|
|
|
LCH_NET_BIND=:8090 \
|
|
|
|
LCH_SIMULTANEOUS_REQUESTS=1 \
|
|
|
|
./llamacpphtmld
|
|
|
|
```
|
|
|
|
|
2023-04-08 23:14:19 +00:00
|
|
|
Use the `GOMAXPROCS` environment variable to control how many threads the llama.cpp engine uses.
|
|
|
|
|
2023-04-08 03:30:37 +00:00
|
|
|
## API usage
|
|
|
|
|
2023-04-08 04:04:55 +00:00
|
|
|
The `generate` endpoint will live stream new tokens into an existing conversation until the LLM stops naturally.
|
|
|
|
|
|
|
|
- Usage: `curl -v -X POST -d '{"Content": "The quick brown fox"}' 'http://localhost:8090/api/v1/generate'`
|
|
|
|
- You can optionally supply `ConversationID` and `APIKey` string parameters. However, these are not currently used by the server.
|
|
|
|
- You can optionally supply a `MaxTokens` integer parameter, to cap the number of generated tokens from the LLM.
|
2023-04-08 03:30:37 +00:00
|
|
|
|
|
|
|
## License
|
|
|
|
|
|
|
|
MIT
|
2023-04-08 04:15:20 +00:00
|
|
|
|
|
|
|
## Changelog
|
|
|
|
|
|
|
|
### 2023-04-08 v1.0.0
|
|
|
|
|
|
|
|
- Initial release
|