Freedom GPT supports any model available in GGUF format, which includes Llama, Mistral, Gemma, DeepSeek, and over 40 model architectures supported by llama.cpp as of 2024. Model choice determines response quality, speed, and memory requirements.
Llama Models
Meta's Llama model family is the most widely used in Freedom GPT. Llama 3 (8B and 70B) and Llama 2 (7B, 13B, 70B) are available as GGUF downloads from Hugging Face. Llama 3 8B at Q4 quantization requires approximately 5 to 6 GB of RAM and delivers strong general-purpose performance for its size.
Mistral Models
Mistral 7B and Mixtral 8x7B are compatible with Freedom GPT via GGUF format. Mistral 7B at Q4 quantization requires approximately 4 to 5 GB of RAM. Mixtral 8x7B, a mixture-of-experts architecture, requires approximately 26 to 30 GB at Q4 quantization, making it suitable only for systems with 32 GB or more of RAM.
Gemma Models
Google DeepMind's Gemma 2 (2B, 9B, 27B) is supported in GGUF format. Gemma 2B at Q4 quantization requires approximately 1.5 to 2 GB of RAM, making it the most accessible option for systems with limited memory. Gemma 2 9B requires approximately 6 GB at Q4 quantization.
DeepSeek Models
DeepSeek's 7B and 14B models are available in GGUF format and run in Freedom GPT. DeepSeek-R1 distilled variants (7B, 14B) are compatible and offer strong reasoning performance relative to their size. The full DeepSeek-R1 671B model requires data center hardware and does not run on consumer systems.
Other Compatible Open-Source LLMs
Any model architecture supported by llama.cpp in GGUF format is compatible with Freedom GPT. This includes Qwen 2.5, Phi-3, Falcon, MPT, and community fine-tuned variants of major model families available on Hugging Face.
Which Models Perform Best Inside Freedom GPT
Llama 3 8B and Mistral 7B deliver the best balance of response quality and hardware efficiency for systems with 8 to 16 GB RAM. For systems with 32 GB or more of RAM, Llama 3 70B at Q4 quantization provides significantly higher output quality for complex reasoning tasks.
Model Size vs Performance Tradeoffs
7B Models
A 7B parameter model at Q4 quantization requires approximately 4 to 5 GB of RAM. Token generation speed on a modern CPU averages 10 to 25 tokens per second depending on processor speed and core count. GPU offloading increases this to 30 to 80 tokens per second with an NVIDIA RTX 3060 or higher.
13B Models
A 13B parameter model at Q4 quantization requires approximately 8 to 10 GB of RAM. CPU-only inference runs at 5 to 15 tokens per second on a modern 8-core processor. This size is the practical upper limit for systems with 16 GB total RAM.
30B to 34B Models
A 30B to 34B parameter model at Q4 quantization requires approximately 18 to 22 GB of RAM. Systems with 32 GB RAM can run these models on CPU but with token generation speeds of 2 to 6 tokens per second. GPU with 24 GB VRAM (such as an RTX 3090 or 4090) enables faster inference at 15 to 30 tokens per second.
70B+ Models
A 70B parameter model at Q4 quantization requires approximately 38 to 45 GB of RAM. Consumer systems cannot run 70B models on GPU alone without dual GPU configurations. CPU-only inference on a 64 GB RAM system produces 1 to 3 tokens per second, which is usable but slow for interactive chat.
Choosing the Right Model for Your Hardware
|
System RAM
|
GPU VRAM
|
Recommended Model Size
|
Expected Speed
|
|
8 GB
|
None
|
Up to 7B (Q4)
|
5 to 15 tokens/sec (CPU)
|
|
16 GB
|
None
|
Up to 13B (Q4)
|
5 to 12 tokens/sec (CPU)
|
|
16 GB
|
8 GB VRAM
|
Up to 7B (full GPU)
|
30 to 60 tokens/sec
|
|
32 GB
|
12 to 16 GB VRAM
|
Up to 13B (full GPU) or 30B (split)
|
20 to 50 tokens/sec
|
|
64 GB
|
24 GB VRAM
|
Up to 34B (full GPU) or 70B (CPU+GPU split)
|
15 to 40 tokens/sec
|
Leave a Comment
Your email address will not be published. Required fields are marked *
By submitting, you agree to receive helpful messages from Chatboq about your request. We do not sell data.