Access state-of-the-art Large Language Models & APIs

1. Overview
2. APIs
3. LLMs
- 3.1. Llama & Llama2
- 3.2. By candle, to obtain llama & llama2 & falcon

It is common to evalute existing LLMs under certain benchmarks. This article is a summerization of some of the existing state-of-the-art language models.

1. Overview

Name	is_official	is_API	Device	Proxy
openai	Y	Y	both	Y
Bard	N	Y	both	Y
Claude	N	Y	both	N

2. APIs

2.1. GPT-3.5-turbo (ChatGPT) & GPT-4

OS: Unbuntu 18.04, also works in Linux mint, Windows, and other environments
Device: both
link: https://github.com/openai/openai-python
proxy: yes

Usage to evalute the politness:

# pip install openai
import openai
import os


def reference_one_sample(text):
    openai.api_key=os.getenv("OPENAI_KEY")
    past_msg_ls=[]
    past_msg_ls.append({"role":"user",
                        "content":text})
    res=openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                        messages=past_msg_ls,
                                        )
    resp=res["choices"][0]['message']['content']
    return resp


def main():
    res=reference_one_sample("fuck your mother. your son of bitch.")
    print(res)
    return 0

2.2. Bard

OS: Unbuntu 18.04, also works in Linux mint, Windows, and other environments
Device: both
link: https://github.com/dsdanielpark/Bard-API
proxy: yes
is official: NO
original app link: https://bard.google.com/

Usages:

# pip install bardapi 
from bardapi import Bard
import os

def inference_one_sample(text):
    token_bard=os.getenv("BARD_KEY")
    print(token_bard)

    proxies = {
    'http': 'http://username:pswd@ip:port',
    'https': 'http://uname:pswd@ip:port'
    }
    session = requests.Session()
    session.headers = {
            "Host": "bard.google.com",
            "X-Same-Domain": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
            "Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
            "Origin": "http://bard.google.com",
            "Referer": "http://bard.google.com/",
        }
    session.cookies.set("__Secure-1PSID", os.getenv("BARD_KEY")) 

    bard = Bard(token=token_bard,
                session=session,
                proxies=proxies,
                timeout=30)

    res=bard.get_answer(text)['content']
    print(res)

def main():
    inference_one_sample("1+1=?")

How to obtain the __Secure-1PSID ?

Sign up or log into the website https://bard.google.com/ in webbrowser.
Open development kits, like Ctrl-Shift-I
find this value in cookies, may be stored in your variables.

using time.sleep(60*15) to control the access frequency to one time per 15 minutes.

2.3. Claude

NOTE: for claude I use the slack API to send/receive messages to/from the claude chatbot in slack.

OS: Unbuntu 18.04, also works in Linux mint, Windows, and other environments
Device: both
link: none
proxy: not required
is official: NO
original app link:
docuement of use slack API

2.3.1. Step 1: set on slack, to obtain a access token to slack as well as a APP ID of your claude chatbot

lookup this article: blablabla

2.3.2. Step 2: use the API of claude

Usages:

import os
import requests

# pip install slack-sdk
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

class SlackClient(WebClient):
    CHANNEL_ID = None 
    CLAUDE_BOT_ID=os.getenv("CLAUDE_BOT_ID")

    def chat(self, text):
        if not self.CHANNEL_ID:
            raise Exception("Channel not found.")
        resp = self.chat_postMessage(channel=self.CHANNEL_ID, text=text)
        self.LAST_TS = resp["ts"]

    def get_reply(self):
        for _ in range(150):
            try:
                resp = self.conversations_history(channel=self.CHANNEL_ID, oldest=self.LAST_TS, limit=2)
                msg = [msg["text"] for msg in resp["messages"]\
                       if msg["user"] == self.CLAUDE_BOT_ID]
                if msg and not msg[-1].endswith("Typing…_"):
                    return msg[-1]
            except (SlackApiError, KeyError) as e:
                print(f"Get reply error: {e}")
                raise Exception("Get replay timeout")

    def open_channel(self):
        response = self.conversations_open(users=self.CLAUDE_BOT_ID)
        self.CHANNEL_ID = response["channel"]["id"]

def forward(text):
    client = SlackClient(token=os.getenv("SLACK_USER_TOKEN")) 
    client.open_channel()
    client.chat(text)
    reply = client.get_reply()
    print(f"Claude: {reply}\n--------------------")

## running entry
if __name__=="__main__":
    # infer_one("hello! nice to meet you!")
    # infer_one("hello! nice to meet you!")
    forward("do you know budda?")

using time.sleep(60*5) to control the access frequency to one time per 5 minutes.

3. LLMs

3.1. Llama & Llama2

you can use llama.cpp to run different version of LLAMA models.

OS: Linux recommanded
Device: GPU recommanded
link: https://github.com/ggerganov/llama.cpp
proxy: not required
is official: NA

First compile this repository in your device:

# clone to local
git clone  https://github.com/ggerganov/llama.cpp

cd llama.cpp

# vanilla compile
make

# other types of compile to supports GPU.

## cuda
# make LLAMA_CUBLAS=1

## Apple device
# LLAMA_METAL=1 make

If you execute this code on your server (e.g. in Ubuntu), you make see some errors due to the compiler, on which you should first update the gcc version by

sudo apt install gcc-Version gcc-Version-multilib g++-Version g++-Version-multilib
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-Version 50
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-Version 50

you can replace "Version" in above commands into the gcc version like 9, and use the new compiler.

Before you make, you may need to execute make clear to clear the compile cache.

While the make procedure is DONE, then you will find a execute file named main in your directory, and you can check it with

export model_path="your llama model path"
export llama_path="your main file path"

export query="hello, whats your name?"

$llama_path -m $model_path -p $query

using ${llama_path} --help to lookup all options:

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -i, --interactive     run in interactive mode
  --interactive-first   run in interactive mode and wait for input right away
  -ins, --instruct      run in instruction mode (use with Alpaca models)
  --multiline-input     allows you to write or paste multiple lines without ending each in '\'
  -r PROMPT, --reverse-prompt PROMPT
                        halt generation at PROMPT, return control in interactive mode
                        (can be specified more than once for multiple prompts).
  --color               colorise output to distinguish prompt and user input from generations
  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)
  -t N, --threads N     number of threads to use during computation (default: 40)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: empty)
  -e                    process prompt escapes sequences (\n, \r, \t, \', \", \\)
  --prompt-cache FNAME  file to cache prompt state for faster startup (default: none)
  --prompt-cache-all    if specified, saves user input and generations to cache as well.
                        not supported with --interactive or other interactive options
  --prompt-cache-ro     if specified, uses the prompt cache but does not update it.
  --random-prompt       start with a randomized prompt.
  --in-prefix-bos       prefix BOS to user inputs, preceding the `--in-prefix` string
  --in-prefix STRING    string to prefix user inputs with (default: empty)
  --in-suffix STRING    string to suffix after user inputs with (default: empty)
  -f FNAME, --file FNAME
                        prompt file to start generation.
  -n N, --n-predict N   number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
  -c N, --ctx-size N    size of the prompt context (default: 512)
  -b N, --batch-size N  batch size for prompt processing (default: 512)
  -gqa N, --gqa N       grouped-query attention factor (TEMP!!! use 8 for LLaMAv2 70B) (default: 1)
  -eps N, --rms-norm-eps N rms norm eps (TEMP!!! use 1e-5 for LLaMAv2) (default: 5.0e-06)
  --top-k N             top-k sampling (default: 40, 0 = disabled)
  --top-p N             top-p sampling (default: 0.9, 1.0 = disabled)
  --tfs N               tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
  --typical N           locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
  --repeat-last-n N     last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
  --repeat-penalty N    penalize repeat sequence of tokens (default: 1.1, 1.0 = disabled)
  --presence-penalty N  repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
  --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
  --mirostat N          use Mirostat sampling.
                        Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.
                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
  --mirostat-lr N       Mirostat learning rate, parameter eta (default: 0.1)
  --mirostat-ent N      Mirostat target entropy, parameter tau (default: 5.0)
  -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS
                        modifies the likelihood of token appearing in the completion,
                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
  --grammar GRAMMAR     BNF-like grammar to constrain generations (see samples in grammars/ dir)
  --grammar-file FNAME  file to read grammar from
  --cfg-negative-prompt PROMPT
                        negative prompt to use for guidance. (default: empty)
  --cfg-scale N         strength of guidance (default: 1.000000, 1.0 = disable)
  --rope-scale N        RoPE context linear scaling factor, inverse of --rope-freq-scale (default: 1)
  --rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: 10000.0)
  --rope-freq-scale N   RoPE frequency linear scaling factor, inverse of --rope-scale (default: 1)
  --ignore-eos          ignore end of stream token and continue generating (implies --logit-bias 2-inf)
  --no-penalize-nl      do not penalize newline token
  --memory-f32          use f32 instead of f16 for memory key+value (default: disabled)
                        not recommended: doubles context memory required and no measurable increase in quality
  --temp N              temperature (default: 0.8)
  --perplexity          compute perplexity over each ctx window of the prompt
  --hellaswag           compute HellaSwag score over random tasks from datafile supplied with -f
  --hellaswag-tasks N   number of tasks to use when computing the HellaSwag score (default: 400)
  --keep N              number of tokens to keep from the initial prompt (default: 0, -1 = all)
  --chunks N            max number of chunks to process (default: -1, -1 = all)
  --mlock               force system to keep model in RAM rather than swapping or compressing
  --no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)
  --numa                attempt optimizations that help on some NUMA systems
                        if run without this previously, it is recommended to drop the system page cache before using this
                        see https://github.com/ggerganov/llama.cpp/issues/1437
  --mtest               compute maximum memory usage
  --export              export the computation graph to 'llama.ggml'
  --verbose-prompt      print prompt before generation
  --simple-io           use basic IO for better compatibility in subprocesses and limited consoles
  --lora FNAME          apply LoRA adapter (implies --no-mmap)
  --lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter
  -m FNAME, --model FNAME
                        model path (default: models/7B/ggml-model.bin)

If you have no LLAMA checkpoints, you can access from the LLAMA page, or download the quantization version from this man, what I have used is the llama2-7B with int8 here such as

Beyond the interactive mode of llama.cpp, you might hope to use it as a API, it is not supports until now, but I simply write a synchronous API for my research and experiements. where the python code can be shown as:

import subprocess

def inference_llama2(sent):
    llama="/home/liangzi/llama.cpp/main"
    model="/home/liangzi/models/llama2/llama-2-7b-chat.ggmlv3.q8_0.bin"
    scripts=f"""
    {llama} -m {model} -p "Please response to the question of user. User: {sent} Answer: "
    """
    result = subprocess.run(scripts, shell=True, capture_output=True).stdout.decode("utf8")
    if "Answer: " in result:
        result=result.split("Answer: ")[1]
    if "User" in result:
        result=result.split("User")[0]
    print(result)
    return result

This is very simple, you can write your own version to support more complicated functions.

3.2. By candle, to obtain llama & llama2 & falcon

candle is a ML framework in Rust, by huggingface.

OS: Linux recommanded
Device: GPU recommanded
link: https://github.com/huggingface/candle
proxy: not required
is official: NA

Compared to llama.cpp, using candle is easier.

you need have a recent stable rust environment in your system to compile candle.
clone the repository with: git clone https://github.com/huggingface/candle
running one of the example commands:

cargo run --example whisper --release
cargo run --example llama --release
cargo run --example falcon --release
cargo run --example bert --release
cargo run --example bigcode --release
cargo run --example stable-diffusion --release --features image -- --prompt "a rusty robot holding a fire torch"

When I use candle, I face the problem:

Error: request error: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer.json: status code 401

Caused by:
    https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer.json: status code 401

which is caused by the access permission of pre-trained models.

Take llama2 as an example, you may should first acess this model in huggingface's website :

you cannot use this model until you got a similar granted texts as mine.

Then, install huggingface_hub , where you might install it already if your python environment has transformers or other huggingface's packages.

Use huggingface-cli login and insert your access token. If you have no token just lookup here.

Above solution comes from this issue.