KaihuaTang / Qwen-Tokenizer-Pruner
Due to the huge vocaburary size (151,936) of Qwen models, the Embedding and LM Head weights are excessively heavy. Therefore, this project provides a Tokenizer vocabulary shearing solution for Qwen and Qwen-VL.
☆20Updated 8 months ago
Alternatives and similar repositories for Qwen-Tokenizer-Pruner:
Users that are interested in Qwen-Tokenizer-Pruner are comparing it to the libraries listed below
- ☆76Updated last week
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration☆47Updated 2 months ago
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs☆133Updated last month
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆48Updated last year
- This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"☆33Updated 9 months ago
- The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":☆36Updated last year
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆42Updated 5 months ago
- Multi-Candidate Speculative Decoding☆35Updated last year
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 3 months ago
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning☆65Updated 2 months ago
- Repo for the EMNLP'24 Paper "Dual-Space Knowledge Distillation for Large Language Models". A general white-box KD framework for both same…☆47Updated 5 months ago
- ☆18Updated 4 months ago
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models☆43Updated 5 months ago
- ☆74Updated this week
- Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"☆43Updated 11 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆53Updated 10 months ago
- Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**☆183Updated 2 months ago
- ☆48Updated 4 months ago
- This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"☆48Updated 9 months ago
- ☆98Updated 6 months ago
- Model merging is a highly efficient approach for long-to-short reasoning.