# llamacpphtmld A web interface and API for the LLaMA large language AI model, based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) runtime. ## Features - Live streaming responses - Continuation-based UI - Supports interrupt, modify, and resume - Configure the maximum number of simultaneous users - Works with any LLaMA model including [Vicuna](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit) - Bundled copy of llama.cpp, no separate compilation required ## Usage All configuration should be supplied as environment variables: ``` LCH_MODEL_PATH=/srv/llama/ggml-vicuna-13b-4bit-rev1.bin \ LCH_NET_BIND=:8090 \ LCH_SIMULTANEOUS_REQUESTS=1 \ ./llamacpphtmld ``` Use the `GOMAXPROCS` environment variable to control how many threads the llama.cpp engine uses. ## API usage The `generate` endpoint will live stream new tokens into an existing conversation until the LLM stops naturally. - Usage: `curl -v -X POST -d '{"Content": "The quick brown fox"}' 'http://localhost:8090/api/v1/generate'` - You can optionally supply `ConversationID` and `APIKey` string parameters. However, these are not currently used by the server. - You can optionally supply a `MaxTokens` integer parameter, to cap the number of generated tokens from the LLM. ## License MIT ## Changelog ### 2023-04-08 v1.0.0 - Initial release