Bug: Load time on rpc server with multiple machines #9820
Labels
bug-unconfirmed
medium severity
Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
stale
What happened?
I have managed to run the rpc server on 2 different machines running ubuntu (with different IPs) with the following commands:
1st machine: bin/rpc-server -H MY_PUPLIC_IP -p 50052
2nd machine: bin/llama-cli -m ../tinydolphin-2.8.2-1.1b-laser.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 6 --rpc MY_PUPLIC_IP:50052 -ngl 99
I have noticed that the load time is huge (Compared to running the model localy using rpc server, where it is only 600ms.):
llama_perf_sampler_print: sampling time = 0,14 ms / 12 runs ( 0,01 ms per token, 82758,62 tokens per second)
llama_perf_context_print: load time = 55658,27 ms
llama_perf_context_print: prompt eval time = 426,00 ms / 6 tokens ( 71,00 ms per token, 14,08 tokens per second)
llama_perf_context_print: eval time = 997,43 ms / 5 runs ( 199,49 ms per token, 5,01 tokens per second)
llama_perf_context_print: total time = 1424,04 ms / 11 tokens
My question is what exaclty happens during the load time?
If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?
Name and Version
version: 3789 (d39e267)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: