Tiny-LLM is a lightweight CUDA C++ inference engine for experimenting with W8A16 quantization, KV Cache incremental decoding, and modular Transformer inference.
Current status: the core runtime, cache flow, and test scaffolding are implemented, but the repository is still experimental. The default demo binary currently reports CUDA readiness rather than providing a polished end-to-end CLI, and runtime GGUF loading is not supported yet.
- W8A16 quantized inference with INT8 weights and FP16 activations
- CUDA kernels for matmul, attention, RMSNorm, and elementwise ops
- Host-side modules for model loading, transformer execution, generation, and cache management
- Dedicated docs site for quick start, API reference, changelog, and contribution notes
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
ctest --output-on-failure
./tiny_llm_demoNotes:
- A working CUDA toolkit with
nvccis required to configure/build this project. - The demo currently validates CUDA availability and prints runtime capability information.
InferenceEngine::load()currently supports the project test binary path viaModelLoader::loadBin();.ggufruntime loading is not wired up yet.
MIT License.