低スペックPCでローカルLLMを動かしてみる

世の中、どこを向いてもAIの話題です。
貧乏人には関係ない…なんてことはなく、低スペックPCでもローカルLLMを動かしてみることはできるようなので、ちょっとトライしてみました。

こちらは、実用的な話題としてではなく半分ジョークという感じで読んでもらえればと思います。そのため、技術的な話題ではなく「低スペックPCでローカルLLMを動かしたぜ」という作業メモといった内容です。

インストール環境

まずは、今回の低スペックPCのCPU情報の確認です。
こんなCPUで動かすなんて正気か?って感じです。

$ lscpu
アーキテクチャ:                        x86_64
  CPU 操作モード:                      32-bit, 64-bit
  アドレスサイズ:                      36 bits physical, 48 bits virtual
  バイト順序:                          Little Endian
CPU:                                   4
  オンラインになっている CPU のリスト: 0-3
ベンダー ID:                           GenuineIntel
  モデル名:                            Intel(R) Core(TM) i5 CPU       M 460  @ 2
                                       .53GHz
    CPU ファミリー:                    6
    モデル:                            37
    コアあたりのスレッド数:            2
    ソケットあたりのコア数:            2
    ソケット数:                        1
    ステッピング:                      5
    周波数ブースト:                    使用する
    CPU スケーリング MHz:              54%
    CPU 最大 MHz:                      2534.0000
    CPU 最小 MHz:                      1199.0000
    BogoMIPS:                          5053.79
    フラグ:                            fpu vme de pse tsc msr pae mce cx8 apic s
                                       ep mtrr pge mca cmov pat pse36 clflush dt
                                       s acpi mmx fxsr sse sse2 ht tm pbe syscal
                                       l nx rdtscp lm constant_tsc arch_perfmon
                                       pebs bts rep_good nopl xtopology nonstop_
                                       tsc cpuid aperfmperf pni dtes64 monitor d
                                       s_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pc
                                       id sse4_1 sse4_2 popcnt lahf_lm pti ssbd
                                       ibrs ibpb stibp tpr_shadow flexpriority e
                                       pt vpid dtherm ida arat vnmi flush_l1d
仮想化機能:
  仮想化:                              VT-x
キャッシュ (合計):
  L1d:                                 64 KiB (2 インスタンス)
  L1i:                                 64 KiB (2 インスタンス)
  L2:                                  512 KiB (2 インスタンス)
  L3:                                  3 MiB (1 インスタンス)
NUMA:
  NUMA ノード数:                       1
  NUMA ノード 0 CPU:                   0-3
脆弱性:
  Gather data sampling:                Not affected
  Indirect target selection:           Not affected
  Itlb multihit:                       KVM: Mitigation: VMX disabled
  L1tf:                                Mitigation; PTE Inversion; VMX conditiona
                                       l cache flushes, SMT vulnerable
  Mds:                                 Vulnerable: Clear CPU buffers attempted,
                                       no microcode; SMT vulnerable
  Meltdown:                            Mitigation; PTI
  Mmio stale data:                     Unknown: No mitigations
  Reg file data sampling:              Not affected
  Retbleed:                            Not affected
  Spec rstack overflow:                Not affected
  Spec store bypass:                   Mitigation; Speculative Store Bypass disa
                                       bled via prctl
  Spectre v1:                          Mitigation; usercopy/swapgs barriers and
                                       __user pointer sanitization
  Spectre v2:                          Mitigation; Retpolines; IBPB conditional;
                                        IBRS_FW; STIBP conditional; RSB filling;
                                        PBRSB-eIBRS Not affected; BHI Not affect
                                       ed
  Srbds:                               Not affected
  Tsa:                                 Not affected
  Tsx async abort:                     Not affected
  Vmscape:                             Not affected

CPUのモデル名はIntel Core i5-460M なので第1世代でした。
CPUのみであれば、AVX512命令セットに対応したCPUが推奨されるようです。
Intelでは必ずしも最新のCoreを選ぶと良いわけではなく、一部のCoreなどではAVX512が非対応となっているケースもあるので注意が必要です。
AVX2であれば、Intel 第4世代以降、AMDRyzenシリーズ全般となります。

次はメモリです。

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.6Gi       2.3Gi        66Mi       4.0Gi       6.0Gi
Swap:          2.0Gi          0B       2.0Gi

メモリはおよそ8GB、多いほど良いけど、今回の動作環境はまぁこんな感じです。

Pythonの仮想設定

いつものようにPyhtonの仮想環境を設定します。

ちなみにOSはこんな感じです。

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.4 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

では、実際に環境設定を行います。

$ mkdir llama
$ cd llama
$ python3 -m venv .venv
$ source .venv/bin/activate

llama-cpp-pythonの導入

今回の主役llama-cpp-pythonを導入します。
llama-cpp-python は必要最小限のスペックなので今回の実験にはピッタリかと思います。

$ pip install llama-cpp-python

必要な環境はPython 3.8以上、Cコンパイラとしてgccまたはclangが必要です。当然、これらは導入済みで実行しています。

また、「Installation Configuration」を参考にすると、基本的な CPU サポートを備えた事前構築済みホイールをインストールして利用できるようです。

その場合は、

$ pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

という風に実行します。

モデルの導入

Hugging Faceのウェブサイト から任意のモデルを入手します。

今回は、GoogleのLLM Gemma2の日本語版(2Bモデル)を入手しました。
Googleさんのgemma-2-2b-jpn-itを量子化したものをalfredplpl氏が提供中なのでこちらを拝借します。

ブラウザでこちらからgemma-2-2b-jpn-it-Q4_K_M.ggufのダウンロードが可能です。
https://huggingface.co/alfredplpl/gemma-2-2b-jpn-it-gguf/tree/main
サイズは1.59Gでした。

直接ダウンロードする場合はこんな感じです。

$ wget https://huggingface.co/alfredplpl/gemma-2-2b-jpn-it-gguf/resolve/main/gemma-2-2b-jpn-it-Q4_K_M.gguf

サイズもそれほど大きくないのでサクッとダウンロードが終わるかと思います。

動作確認

それでは、さっそく動作確認をしてみます。

$ python3 -m llama_cpp.server --model ./gemma-2-2b-jpn-it-Q4_K_M.gguf --host 0.0.0.0 --port 8000
 :
ModuleNotFoundError: No module named 'uvicorn'

私の環境では幾つかモジュールが足りないようです。実行するたびに必要なモジュールを追加していきます。

最終的にはこんな感じ

$ pip install uvicorn anyio starlette fastapi sse_starlette starlette_context pydantic_settings

無事に起動出来たら
http://127.0.0.1:8000/docs
にアクセスしてみます。

こんな画面が表示されれば動作OKです。

アプリケーションからのアクセス

今度は、Pythonで実際にコードを書いて動かしてみたいと思います。

from llama_cpp import Llama

# ダウンロードしたモデルへのパスを指定
llm = Llama(model_path="./gemma-2-2b-jpn-it-Q4_K_M.gguf")
output = llm("こんにちは、", max_tokens=32, stop=["\n"], echo=True)
print(output)

実行してみます。

$ python llama_test.py
llama_model_loader: loaded meta data with 40 key-value pairs and 288 tensors from ./gemma-2-2b-jpn-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b Jpn It
llama_model_loader: - kv   3:                           general.finetune str              = jpn-it
llama_model_loader: - kv   4:                           general.basename str              = gemma-2
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 2 2b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-2...
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["conversational", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["ja"]
llama_model_loader: - kv  13:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv  14:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv  15:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  16:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  17:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  18:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  21:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  24:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  25:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  29:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.59 GiB (5.21 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token:     45 '<unused38>' is not marked as EOG
load: control token:     74 '<unused67>' is not marked as EOG
load: control token:     55 '<unused48>' is not marked as EOG
load: control token:     99 '<unused92>' is not marked as EOG
load: control token:    102 '<unused95>' is not marked as EOG
load: control token:     44 '<unused37>' is not marked as EOG
load: control token:     26 '<unused19>' is not marked as EOG
load: control token:     42 '<unused35>' is not marked as EOG
load: control token:     92 '<unused85>' is not marked as EOG
load: control token:     90 '<unused83>' is not marked as EOG
load: control token:    106 '<start_of_turn>' is not marked as EOG
load: control token:     88 '<unused81>' is not marked as EOG
load: control token:      5 '<2mass>' is not marked as EOG
load: control token:    104 '<unused97>' is not marked as EOG
load: control token:     68 '<unused61>' is not marked as EOG
load: control token:     94 '<unused87>' is not marked as EOG
load: control token:     59 '<unused52>' is not marked as EOG
load: control token:      2 '<bos>' is not marked as EOG
load: control token:     25 '<unused18>' is not marked as EOG
load: control token:     93 '<unused86>' is not marked as EOG
load: control token:     95 '<unused88>' is not marked as EOG
load: control token:     76 '<unused69>' is not marked as EOG
load: control token:     97 '<unused90>' is not marked as EOG
load: control token:     56 '<unused49>' is not marked as EOG
load: control token:     81 '<unused74>' is not marked as EOG
load: control token:     13 '<unused6>' is not marked as EOG
load: control token:     51 '<unused44>' is not marked as EOG
load: control token:     47 '<unused40>' is not marked as EOG
load: control token:      8 '<unused1>' is not marked as EOG
load: control token:    103 '<unused96>' is not marked as EOG
load: control token:     75 '<unused68>' is not marked as EOG
load: control token:     79 '<unused72>' is not marked as EOG
load: control token:     39 '<unused32>' is not marked as EOG
load: control token:     49 '<unused42>' is not marked as EOG
load: control token:     41 '<unused34>' is not marked as EOG
load: control token:     34 '<unused27>' is not marked as EOG
load: control token:      6 '[@BOS@]' is not marked as EOG
load: control token:     40 '<unused33>' is not marked as EOG
load: control token:     33 '<unused26>' is not marked as EOG
load: control token:     86 '<unused79>' is not marked as EOG
load: control token:     43 '<unused36>' is not marked as EOG
load: control token:     35 '<unused28>' is not marked as EOG
load: control token:     32 '<unused25>' is not marked as EOG
load: control token:     28 '<unused21>' is not marked as EOG
load: control token:     19 '<unused12>' is not marked as EOG
load: control token:     67 '<unused60>' is not marked as EOG
load: control token:      9 '<unused2>' is not marked as EOG
load: control token:     52 '<unused45>' is not marked as EOG
load: control token:     16 '<unused9>' is not marked as EOG
load: control token:     98 '<unused91>' is not marked as EOG
load: control token:     80 '<unused73>' is not marked as EOG
load: control token:     71 '<unused64>' is not marked as EOG
load: control token:     36 '<unused29>' is not marked as EOG
load: control token:      0 '<pad>' is not marked as EOG
load: control token:     11 '<unused4>' is not marked as EOG
load: control token:     70 '<unused63>' is not marked as EOG
load: control token:     77 '<unused70>' is not marked as EOG
load: control token:     64 '<unused57>' is not marked as EOG
load: control token:     50 '<unused43>' is not marked as EOG
load: control token:     20 '<unused13>' is not marked as EOG
load: control token:     73 '<unused66>' is not marked as EOG
load: control token:     23 '<unused16>' is not marked as EOG
load: control token:     38 '<unused31>' is not marked as EOG
load: control token:     21 '<unused14>' is not marked as EOG
load: control token:     15 '<unused8>' is not marked as EOG
load: control token:     37 '<unused30>' is not marked as EOG
load: control token:     14 '<unused7>' is not marked as EOG
load: control token:     30 '<unused23>' is not marked as EOG
load: control token:     62 '<unused55>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:     18 '<unused11>' is not marked as EOG
load: control token:     22 '<unused15>' is not marked as EOG
load: control token:     66 '<unused59>' is not marked as EOG
load: control token:     65 '<unused58>' is not marked as EOG
load: control token:     10 '<unused3>' is not marked as EOG
load: control token:    105 '<unused98>' is not marked as EOG
load: control token:     87 '<unused80>' is not marked as EOG
load: control token:    100 '<unused93>' is not marked as EOG
load: control token:     63 '<unused56>' is not marked as EOG
load: control token:     31 '<unused24>' is not marked as EOG
load: control token:     58 '<unused51>' is not marked as EOG
load: control token:     84 '<unused77>' is not marked as EOG
load: control token:     61 '<unused54>' is not marked as EOG
load: control token:      1 '<eos>' is not marked as EOG
load: control token:     60 '<unused53>' is not marked as EOG
load: control token:     91 '<unused84>' is not marked as EOG
load: control token:     83 '<unused76>' is not marked as EOG
load: control token:     85 '<unused78>' is not marked as EOG
load: control token:     27 '<unused20>' is not marked as EOG
load: control token:     96 '<unused89>' is not marked as EOG
load: control token:     72 '<unused65>' is not marked as EOG
load: control token:     53 '<unused46>' is not marked as EOG
load: control token:     82 '<unused75>' is not marked as EOG
load: control token:      7 '<unused0>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:    101 '<unused94>' is not marked as EOG
load: control token:     78 '<unused71>' is not marked as EOG
load: control token:     89 '<unused82>' is not marked as EOG
load: control token:     69 '<unused62>' is not marked as EOG
load: control token:     54 '<unused47>' is not marked as EOG
load: control token:     57 '<unused50>' is not marked as EOG
load: control token:     12 '<unused5>' is not marked as EOG
load: control token:     48 '<unused41>' is not marked as EOG
load: control token:     17 '<unused10>' is not marked as EOG
load: control token:     24 '<unused17>' is not marked as EOG
load: control token:     46 '<unused39>' is not marked as EOG
load: control token:     29 '<unused22>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 107 ('<end_of_turn>')
load: special tokens cache size = 249
load: token to piece cache size = 1.6014 MB
print_info: arch             = gemma2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 2304
print_info: n_layer          = 26
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 4096
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 9216
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 2B
print_info: model params     = 2.61 B
print_info: general.name     = Gemma 2 2b Jpn It
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 1
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 1
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 1
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 1
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 1
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 1
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 1
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 1
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 1
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 1
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 1
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 1
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 1
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 288 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead
load_tensors:   CPU_Mapped model buffer size =  1623.67 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.98 MiB
create_memory: n_ctx = 512 (padded)
llama_kv_cache_unified_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 512 cells
llama_kv_cache_unified: layer   0: skipped
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: skipped
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: skipped
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: skipped
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: skipped
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: skipped
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: skipped
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: skipped
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: skipped
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: skipped
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: skipped
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: skipped
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: skipped
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified:        CPU KV buffer size =    26.00 MiB
llama_kv_cache_unified: size =   26.00 MiB (   512 cells,  13 layers,  1/1 seqs), K (f16):   13.00 MiB, V (f16):   13.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 512 cells
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: skipped
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: skipped
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: skipped
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: skipped
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: skipped
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: skipped
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: skipped
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: skipped
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: skipped
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: skipped
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: skipped
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: skipped
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: skipped
llama_kv_cache_unified:        CPU KV buffer size =    26.00 MiB
llama_kv_cache_unified: size =   26.00 MiB (   512 cells,  13 layers,  1/1 seqs), K (f16):   13.00 MiB, V (f16):   13.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 2304
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:        CPU compute buffer size =   504.50 MiB
llama_context: graph nodes  = 1128
llama_context: graph splits = 1
CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Model metadata: {'general.quantization_version': '2', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.chat_template': "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}", 'tokenizer.ggml.padding_token_id': '0', 'general.base_model.0.repo_url': 'https://huggingface.co/google/gemma-2-2b-it', 'general.license': 'gemma', 'gemma2.attn_logit_softcapping': '50.000000', 'tokenizer.ggml.add_bos_token': 'true', 'general.size_label': '2B', 'general.type': 'model', 'gemma2.embedding_length': '2304', 'gemma2.block_count': '26', 'tokenizer.ggml.pre': 'default', 'general.base_model.count': '1', 'general.base_model.0.organization': 'Google', 'general.basename': 'gemma-2', 'gemma2.context_length': '8192', 'general.architecture': 'gemma2', 'gemma2.feed_forward_length': '9216', 'gemma2.attention.head_count': '8', 'tokenizer.ggml.add_eos_token': 'false', 'gemma2.attention.head_count_kv': '4', 'general.base_model.0.name': 'Gemma 2 2b It', 'gemma2.attention.key_length': '256', 'gemma2.attention.value_length': '256', 'gemma2.attention.layer_norm_rms_epsilon': '0.000001', 'general.finetune': 'jpn-it', 'general.file_type': '15', 'gemma2.attention.sliding_window': '4096', 'gemma2.final_logit_softcapping': '30.000000', 'tokenizer.ggml.model': 'llama', 'general.name': 'Gemma 2 2b Jpn It', 'tokenizer.ggml.bos_token_id': '2', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '3'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token: <eos>
Using chat bos_token: <bos>
llama_perf_context_print:        load time =    1394.68 ms
llama_perf_context_print: prompt eval time =    1394.48 ms /     3 tokens (  464.83 ms per token,     2.15 tokens per second)
llama_perf_context_print:        eval time =    1267.61 ms /     2 runs   (  633.81 ms per token,     1.58 tokens per second)
llama_perf_context_print:       total time =    2672.58 ms /     5 tokens
llama_perf_context_print:    graphs reused =          1
{'id': 'cmpl-b95b0d16-5f87-458a-8715-d90b48e6d4ac', 'object': 'text_completion', 'created': 1771488001, 'model': './gemma-2-2b-jpn-it-Q4_K_M.gguf', 'choices': [{'text': 'こんにちは、どうも!', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 3, 'completion_tokens': 3, 'total_tokens': 6}}

無事になにか返ってきました。

ちょっとコードを修正して、時間を測定してみます。

from llama_cpp import Llama
import time

# 処理時間の計測開始
start_time = time.time()

# ダウンロードしたモデルへのパスを指定
llm = Llama(model_path="./gemma-2-2b-jpn-it-Q4_K_M.gguf")
output = llm("こんにちは、", max_tokens=32, stop=["\n"], echo=True)

# 処理時間計測終了
end_time = time.time()

# 応答表示
print(output['choices'][0]['text'])
print(f"処理時間:{end_time - start_time:.2f} 秒")

実行結果は

こんにちは、どうも!
処理時間:3.70 秒

わぉ、それなりの時間ですねぇ。
皆さんの環境ではどんな感じでしょうか?

ちなみに、第3世代16GBを搭載したIntel i7-3720QMのマシンでは、1.54秒でした。

Python

前の記事

MacBook AirでPython