Instructions to use danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq") model = AutoModelForCausalLM.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq
- SGLang
How to use danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq with Docker Model Runner:
docker model run hf.co/danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq
Sample repository
Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. Contact: parkminwoo1991@gmail.com.
danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq model card
- 4-bit quantization and 128 group size weight of danielpark/ko-llama-2-jindo-7b-instruct
- GPTQ is the state-of-the-art one-shot weight quantization method. This code is built upon GPTQ, GPTQ-for-LLaMa, GPTQ-triton, Auto-GPTQ.
Prompt Template
### System:
{System}
### User:
{User}
### Assistant:
{Assistant}
Inference
Install AutoGPTQ for generating.
$ pip install auto-gptq
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM
# Set config
MODEL_NAME_OR_PATH = "danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq"
MODEL_BASENAME = "gptq_model-4bit-128g"
USE_TRITON = False
MODEL, TOKENIZER = AutoGPTQForCausalLM.from_quantized(
MODEL_NAME_OR_PATH,
model_basename=MODEL_BASENAME,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=USE_TRITON,
quantize_config=None
), AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_fast=True)
def generate_text_with_model(prompt):
prompt_template = f"{prompt}\n"
input_ids = TOKENIZER(prompt_template, return_tensors='pt').input_ids.cuda()
output = MODEL.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
generated_text = TOKENIZER.decode(output[0])
return generated_text
def generate_text_with_pipeline(prompt):
logging.set_verbosity(logging.CRITICAL)
pipe = pipeline(
"text-generation",
model=MODEL,
tokenizer=TOKENIZER,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
prompt_template = f"{prompt}\n"
generated_text = pipe(prompt_template)[0]['generated_text']
return generated_text
# Example
prompt_text = "What is GPTQ?"
generated_text_model = generate_text_with_model(prompt_text)
print(generated_text_model)
generated_text_pipeline = generate_text_with_pipeline(prompt_text)
print(generated_text_pipeline)
Web Demo
I implement the web demo using several popular tools that allow us to rapidly create web UIs.
| model | web ui | quantinized |
|---|---|---|
| danielpark/ko-llama-2-jindo-7b-instruct. | using gradio on colab | - |
| danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq | using text-generation-webui on colab | gptq |
| danielpark/ko-llama-2-jindo-7b-instruct-ggml | koboldcpp-v1.38 | ggml |
- Downloads last month
- -