llamacpphtmld/README.md

# llamacpphtmld

A web interface and API for the LLaMA large language AI model, based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) runtime.

## Features

- Live streaming responses
- Continuation-based UI
- Supports interrupt, modify, and resume
- Configure the maximum number of simultaneous users
- Works with any LLaMA model including [Vicuna](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit)
- Bundled copy of llama.cpp, no separate compilation required

## Usage

All configuration should be supplied as environment variables:

```
LCH_MODEL_PATH=/srv/llama/ggml-vicuna-13b-4bit-rev1.bin \
	LCH_NET_BIND=:8090 \
	LCH_SIMULTANEOUS_REQUESTS=1 \
	./llamacpphtmld
```

Use the `GOMAXPROCS` environment variable to control how many threads the llama.cpp engine uses.

## API usage

The `generate` endpoint will live stream new tokens into an existing conversation until the LLM stops naturally.

- Usage: `curl -v -X POST -d '{"Content": "The quick brown fox"}' 'http://localhost:8090/api/v1/generate'`
- You can optionally supply `ConversationID` and `APIKey` string parameters. However, these are not currently used by the server.
- You can optionally supply a `MaxTokens` integer parameter, to cap the number of generated tokens from the LLM.

## License

MIT

## Changelog

### 2023-04-08 v1.0.0

- Initial release
doc/README: initial commit 2023-04-08 03:30:37 +00:00			`# llamacpphtmld`

			`A web interface and API for the LLaMA large language AI model, based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) runtime.`

			`## Features`

			`- Live streaming responses`
doc/README: update features + api docs 2023-04-08 04:04:55 +00:00			`- Continuation-based UI`
			`- Supports interrupt, modify, and resume`
doc/README: initial commit 2023-04-08 03:30:37 +00:00			`- Configure the maximum number of simultaneous users`
			`- Works with any LLaMA model including [Vicuna](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit)`
			`- Bundled copy of llama.cpp, no separate compilation required`

			`## Usage`

			`All configuration should be supplied as environment variables:`

			```
			`LCH_MODEL_PATH=/srv/llama/ggml-vicuna-13b-4bit-rev1.bin \`
			`LCH_NET_BIND=:8090 \`
			`LCH_SIMULTANEOUS_REQUESTS=1 \`
			`./llamacpphtmld`
			```

doc/README: mention GOMAXPROCS is passed through to llama.cpp 2023-04-08 23:14:19 +00:00			Use the `GOMAXPROCS` environment variable to control how many threads the llama.cpp engine uses.

doc/README: initial commit 2023-04-08 03:30:37 +00:00			`## API usage`

doc/README: update features + api docs 2023-04-08 04:04:55 +00:00			The `generate` endpoint will live stream new tokens into an existing conversation until the LLM stops naturally.

			- Usage: `curl -v -X POST -d '{"Content": "The quick brown fox"}' 'http://localhost:8090/api/v1/generate'`
			- You can optionally supply `ConversationID` and `APIKey` string parameters. However, these are not currently used by the server.
			- You can optionally supply a `MaxTokens` integer parameter, to cap the number of generated tokens from the LLM.
doc/README: initial commit 2023-04-08 03:30:37 +00:00
			`## License`

			`MIT`
doc/README: changelog for v1.0.0 2023-04-08 04:15:20 +00:00
			`## Changelog`

			`### 2023-04-08 v1.0.0`

			`- Initial release`