llama.cpp
llama.cpp runs at native speed when compiled for CUDA architecture 86 and with cuBLAS enabled:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FORCE_CUBLAS=true
Compiling for multiple CUDA architectures should be fine as long as one of the architectures is 80, 86 or 89.
Compiling with cuBLAS disabled might lead to performance degradation.
Windows
You need to install HIP SDK to have access to rocBLAS