ds4.c is a narrow Metal inference engine for DeepSeek V4 Flash. It is not a general GGUF runner and only supports the GGUF files published for the antirez/deepseek-v4-gguf model repo.
Served Model IDs
| Model ID |
Behavior |
deepseek-v4-flash |
Primary model ID. |
deepseek-chat |
Alias that disables thinking for direct answers. |
deepseek-reasoner |
Alias that enables thinking. |
Downloadable GGUF Files
| Variant |
Purpose |
File |
Approx. size |
Intended machine |
q2 |
Main model, lower memory |
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf |
86.7 GB |
128 GB RAM Macs |
q4 |
Main model, larger/higher quality |
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf |
165 GB |
256 GB+ RAM Macs |
mtp |
Optional speculative decoding support |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf |
3.81 GB |
Optional with either q2 or q4 |
System Requirements and Caveats
| Area |
Requirement or caveat |
| Production backend |
Metal-only. The server is Metal-only. |
| Target hardware |
High-end Macs / Mac Studios with large unified memory. |
| Minimum practical memory |
128 GB for the q2 model. |
| q4 memory |
At least 256 GB RAM. |
| Maximum context |
Model supports a 1M-token context window. |
| Full-context memory overhead |
Around 26 GB extra memory for 1M context, with the compressed indexer around 22 GB. |
| Practical 128 GB context |
Around 100k to 300k tokens is wiser than full 1M context. |
| CPU path |
Reference/debug only, not production. README warns current macOS CPU execution can crash the kernel. |
| Build |
make |
| Download commands |
./download_model.sh q2, ./download_model.sh q4, or ./download_model.sh mtp |
Server API
| Endpoint |
Purpose |
GET /v1/models |
List available served models. |
GET /v1/models/deepseek-v4-flash |
Fetch metadata for the primary model. |
POST /v1/chat/completions |
OpenAI-compatible chat completions. |
POST /v1/completions |
OpenAI-compatible completions. |
POST /v1/messages |
Anthropic-compatible messages endpoint. |