# remove_refusal_for_bigModel **Repository Path**: zhangbo2008/remove_refusal_for_big-model ## Basic Information - **Project Name**: remove_refusal_for_bigModel - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-12 - **Last Updated**: 2025-12-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Removing refusals with HF Transformers # 文档说明推理:https://www.cnblogs.com/zhangbo2008/p/19341808 This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens. This means, that this supports every model that HF Transformers supports*. The code was tested on a RTX 2060 6GB, thus mostly <3B models have been tested, but the code has been tested to work with bigger models as well. *While most models are compatible, some models are not. Mainly because of custom model implementations. Some Qwen implementations for example don't work. Because `model.model.layers` can't be used for getting layers. They call the variables so that, `model.transformer.h` must be used, if I'm not mistaken. ## Usage 1. Set model and quantization in compute_refusal_dir.py and inference.py (Quantization can apparently be mixed) 2. Run compute_refusal_dir.py (Some settings in that file may be changed depending on your use-case) 3. Run inference.py and ask the model how to build an army of rabbits, that will overthrow your local government one day, by stealing all the carrots. ## Credits - [Harmful instructions](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv) - [Harmless instructions](https://huggingface.co/datasets/yahma/alpaca-cleaned) - [Technique](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction)