Malaysia-AI blog multi-nodes GPUs Ray serving

multi-nodes GPUs Ray serving

Turns out, to serve a model on different N machines is super simple using Ray, just make sure all machines already installed Ray, and 1 machine run as head and another machines run as workers,

For head,

```bash
ray start --head --port=6379 --dashboard-host=0.0.0.0
```

For worker,

```bash
ray start --address=HEAD_NODE_IP:6379
```

After that you can just start with `serve.deployment` with custom `__init__`,

```python
import requests
from starlette.requests import Request
from typing import Dict
from transformers import pipeline 
from ray import serve 

# 1: Wrap the pretrained sentiment analysis model in a Serve deployment. 
@serve.deployment( 
    num_replicas=2,
    ray_actor_options={"num_gpus": 1} 
) 
class TranslationDeployment: 
    def __init__(self): 
        self._model = pipeline("translation", model="google/flan-t5-large", device="cuda") 
        
    def __call__(self, request: Request) -> Dict: 
        return self._model(request.query_params["text"])[0] 
        
# 2: Deploy the deployment. 
serve.run(TranslationDeployment.bind(), route_prefix="/") 

# 3: Query the deployment and print the result. 
print( 
    requests.get( 
        "http://localhost:8000/", params={"text": "Ray Serve is great!"} 
    ).json() 
)
```

1. The head will pickled the object and send to it to each replicas.

2. To have multiple replica sets, use the argument `num_replicas` in the master node.

3. The replicas can be the head itself and the workers.

4. The code execution can be anywhere, not necessary inside the head, but the default ray serve connect to `localhost:6379`.

5. The master node needs to run on a linux based operating system.

6. The master node needs to use the argument `ray_actor_options` to allow ray to utilise the GPU on the master node.

How to make it better?

If you look at the source code, we are serving Encoder-Decoder model, which is Flan T5 Large.

When we talk about Encoder-Decoder model, the causal happened on Decoder side, and inferencing causal just a continuous loop until reached max length or EOS token.

Serving like this, it is not efficient for concurrency because GPUs are designed to perform same operation on a batch manner, plus too much processing switching happened between for different requests.

So what we need to do is to micro-batch the requests, but the micro-batch must happened during causal loop, this called continuous batching.