FASRC Deployment
login to fasrc login ndoe
alloc gpu resources
alloc -c 32 -m 1024 -g 2 -u h200 -t 168 -p seas_gputhen you will get a gpu node name, you need to ssh to the gpu node
ssh to the gpu node, serving the model by vllm
# docker deploy on holygpu node podman run --rm \ --device nvidia.com/gpu=all \ -p 8000:8000 \ --ipc=host \ -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \ docker.io/vllm/vllm-openai:latest \ --model /models/meta-llama_Llama-4-Scout-17B-16E \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.95 # qwen3 coder podman run --rm \ --device nvidia.com/gpu=all \ -p 8000:8000 \ --ipc=host \ -v /n/netscratch/juncheng_lab/muxint/llm_models:/models:Z \ docker.io/vllm/vllm-openai:latest \ --model /models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8 \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.95
Note that you should change the port name and gpu model name. You need to change the tensor-parallel-size based on the number of allocated node .
create the ssh tunnel
ssh -N -f -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 -o ServerAliveCountMax=60 -R 0.0.0.0:8000:localhost:8000 murphy@freeinference.org # you need to change the port name # run this monitor script could continuously maintain the ssh nohup ./tunnel_monitor.sh > /dev/null 2>&1 &
check if the reverse proxy work
curl http://freeinference.org:8001/v1/models
docker
# save docker images
podman save docker.io/vllm/vllm-openai:latest -o /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar
# load docker images
podman load -i /n/netscratch/juncheng_lab/muxint/vllm-openai-latest.tar
# check existed images
podman images