Setting Up NVIDIA Tesla P40 for AI Inference

The Tesla P40 is a fantastic budget option for running local LLMs. With 24GB of VRAM and decent compute performance, it can handle models like Llama 2 70B (quantized) or Mixtral 8x7B.

Why the Tesla P40?

24GB VRAM: Enough for large quantized models
Price: $200-300 on the secondary market
Power: 250W TDP, manageable for homelab
ECC Memory: Reliable for long inference runs

Hardware Requirements

Before starting:

A server or workstation with a PCIe x16 slot
Adequate power supply (at least 250W headroom)
Good airflow or a GPU cooling solution (P40 is passive!)
Ubuntu 22.04 or similar Linux distro

Step 1: Cooling Solution

Important: The Tesla P40 is passively cooled and requires significant airflow. Options:

Rack server: Use proper rack fans
Desktop case: Add a 92mm fan with 3D-printed shroud
Aftermarket cooler: Gelid ICY Vision or similar

Step 2: Install NVIDIA Drivers

# Add NVIDIA repository
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update

# Install the driver
sudo apt install nvidia-driver-535

# Reboot
sudo reboot

# Verify installation
nvidia-smi

Step 3: Install CUDA Toolkit

Download and install CUDA:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run --toolkit --silent

Then add CUDA to your environment (add these lines to ~/.bashrc):

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Finally run source ~/.bashrc to apply the changes.

Step 4: Set Up Ollama

Ollama makes running LLMs incredibly easy:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start the service
sudo systemctl enable ollama
sudo systemctl start ollama

# Pull a model
ollama pull llama2:70b-chat-q4_K_M

Step 5: Test Your Setup

# Run an interactive session
ollama run llama2:70b-chat-q4_K_M

Or use the API:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama2:70b-chat-q4_K_M", "prompt": "Hello"}'

Performance Tips

Quantization: Use Q4_K_M or Q5_K_M for best speed/quality balance
Context Length: Keep context reasonable (4096-8192 tokens)
Batch Size: Increase for throughput, decrease for latency
Temperature Control: Monitor GPU temps, aim for under 85°C

Conclusion

The Tesla P40 is an excellent choice for budget AI inference. While it lacks the speed of newer GPUs, its 24GB VRAM makes it viable for running large models at home.

Happy inferencing! 🤖