Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Load time on rpc server with multiple machines #9820

Open
angelosathanasiadis opened this issue Oct 10, 2024 · 1 comment
Open

Bug: Load time on rpc server with multiple machines #9820

angelosathanasiadis opened this issue Oct 10, 2024 · 1 comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@angelosathanasiadis
Copy link

What happened?

I have managed to run the rpc server on 2 different machines running ubuntu (with different IPs) with the following commands:

1st machine: bin/rpc-server -H MY_PUPLIC_IP -p 50052
2nd machine: bin/llama-cli -m ../tinydolphin-2.8.2-1.1b-laser.Q4_K_M.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 6 --rpc MY_PUPLIC_IP:50052 -ngl 99

I have noticed that the load time is huge (Compared to running the model localy using rpc server, where it is only 600ms.):
llama_perf_sampler_print: sampling time = 0,14 ms / 12 runs ( 0,01 ms per token, 82758,62 tokens per second)
llama_perf_context_print: load time = 55658,27 ms
llama_perf_context_print: prompt eval time = 426,00 ms / 6 tokens ( 71,00 ms per token, 14,08 tokens per second)
llama_perf_context_print: eval time = 997,43 ms / 5 runs ( 199,49 ms per token, 5,01 tokens per second)
llama_perf_context_print: total time = 1424,04 ms / 11 tokens

My question is what exaclty happens during the load time?
If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?

Name and Version

version: 3789 (d39e267)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

No response

@angelosathanasiadis angelosathanasiadis added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Oct 10, 2024
@rgerganov
Copy link
Collaborator

My question is what exaclty happens during the load time?

Model layers are being transferred to the RPC server

If I assume that the model exists in all machines, is there the capability to load the model localy instead of loading it through the network?

This has been requested several times, @slaren put some ideas here

@github-actions github-actions bot added the stale label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

2 participants