Go to file

Alex Rozanski 5f4905a256 separate Swift and C++/Objective-C++ code into separate libraries		2023-03-16 00:24:12 +01:00
.github/workflows	Add windows to the CI (#98 )	2023-03-13 22:29:10 +02:00
Sources	separate Swift and C++/Objective-C++ code into separate libraries	2023-03-16 00:24:12 +01:00
llama.xcodeproj	separate Swift and C++/Objective-C++ code into separate libraries	2023-03-16 00:24:12 +01:00
llamaTest	improve output	2023-03-14 11:45:59 +01:00
.gitignore	add basic Xcode project and include cpp files	2023-03-14 11:45:59 +01:00
LICENSE	update license file	2023-03-14 11:53:49 +01:00
README.md	improve README	2023-03-14 12:03:12 +01:00
convert-pth-to-ggml.py	Fix UTF-8 handling (including colors) (#79 )	2023-03-13 18:24:18 +02:00
quantize.sh	Add quantize script for batch quantization (#92 )	2023-03-13 18:15:20 +02:00

README.md

llama.swift

A fork of @ggerganov's llama.cpp to use Facebook's LLaMA in Swift.

Description

See the main repository for info about the C++ implementation.

Setup

Here are the step for the LLaMA-7B model:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize.sh 7B

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

When running the larger models, make sure you have enough disk space to store all the intermediate files.

Building

For now, compile from source. Will add other distribution channels shortly.

NB: Ensure to build llama.framework for Release for snappiness; Debug builds are super slow.

Usage

In Swift:

let url = ... // URL to the model file, as per llama.cpp
let llama = LlamaRunner(modelURL: url)

llama.run(
  with: "Building a website can be done in 10 simple steps:",
  config: LlamaRunner.Config(numThreads: 8, numTokens: 512) // Can also specify `reversePrompt`
  tokenHandler: { token in
    // If printing tokens directly use `terminator: ""` as the tokens include whitespace and newlines.
    print(token, terminator: "")
  },
  stateChangeHandler: { state in
    switch state {
    case .notStarted:
      // ...
      break
    case .initializing:
      // ...
      break
    case .generatingOutput:
      // ...
      break 
    case .completed:
      // ...
      break
    case .failed(error: let error):
      // ...
      break
    }
  })

Using the llamaTest app:

Set MODEL_PATH in LlamaTest.xcconfig to point to your path/to/ggml-model-q4_0.bin, then build & run for interactive prompt generation.
Ensure to build for Release if you want this to be snappy.

Misc

License: MIT
Other matters: See the llama.cpp repo.