FASRC Deployment

login to fasrc login ndoe
alloc gpu resources alloc -c 32 -m 1024 -g 2 -u h200 -t 168 -p seas_gpu

then you will get a gpu node name, you need to ssh to the gpu node

ssh to the gpu node, serving the model by vllm

# docker deploy on holygpu node
podman run --rm \
    --device nvidia.com/gpu=all \
    -p 8000:8000 \
    --ipc=host \
    -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \
    docker.io/vllm/vllm-openai:latest \
    --model /models/meta-llama_Llama-4-Scout-17B-16E \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.95

# qwen3 coder
podman run --rm \
    --device nvidia.com/gpu=all \
    -p 8000:8000 \
    --ipc=host \
    -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \
    docker.io/vllm/vllm-openai:latest \
    --model /models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95

Note that you should change the port name and gpu model name. You need to change the tensor-parallel-size based on the number of allocated node .

create the ssh tunnel

ssh -N -f -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 -o ServerAliveCountMax=60 -R 0.0.0.0:8000:localhost:8000 murphy@freeinference.org
# you need to change the port name

# run this monitor script could continuously maintain the ssh
nohup ./tunnel_monitor.sh > /dev/null 2>&1 &

check if the reverse proxy work

curl http://freeinference.org:8001/v1/models

docker

# save docker images
podman save docker.io/vllm/vllm-openai:latest -o /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar

# load docker images
podman load -i /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar

# check existed images
podman images