# **FASRC Deployment**

1. login to fasrc login ndoe
2. alloc gpu resources `alloc -c 32 -m 1024 -g 2 -u h200 -t 168 -p seas_gpu`

    then you will get a gpu node name, you need to ssh to the gpu node

3. ssh to the gpu node, serving the model by vllm

    ```
    # docker deploy on holygpu node
    podman run --rm \
        --device nvidia.com/gpu=all \
        -p 8000:8000 \
        --ipc=host \
        -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \
        docker.io/vllm/vllm-openai:latest \
        --model /models/meta-llama_Llama-4-Scout-17B-16E \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 2 \
        --gpu-memory-utilization 0.95

    # qwen3 coder
    podman run --rm \
        --device nvidia.com/gpu=all \
        -p 8000:8000 \
        --ipc=host \
        -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \
        docker.io/vllm/vllm-openai:latest \
        --model /models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8 \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.95
    ```

    - Note that you should change the port name and gpu model name. You need to change the tensor-parallel-size based on the number of allocated node .
4. create the ssh tunnel

    ```
    ssh -N -f -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 -o ServerAliveCountMax=60 -R 0.0.0.0:8000:localhost:8000 murphy@freeinference.org
    # you need to change the port name

    # run this monitor script could continuously maintain the ssh
    nohup ./tunnel_monitor.sh > /dev/null 2>&1 &
    ```

5. check if the reverse proxy work

    ```
    curl http://freeinference.org:8001/v1/models
    ```


## **docker**

```
# save docker images
podman save docker.io/vllm/vllm-openai:latest -o /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar

# load docker images
podman load -i /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar

# check existed images
podman images
```