本地部署 ChatGLM2-6B【保姆级教程】：从零搭建你的中英双语大模型对话助手

面向希望在本机可用的读者，从零梳理 ChatGLM2-6B 的硬件／显存要求、环境与依赖、权重下载、推理自检与最简单的流式对话 Demo，并顺带整理常见问题与轻量化思路。

前言

ChatGLM2-6B 是清华大学开源的、在中英双语上表现卓越的对话模型。它不仅性能强劲，更重要的是它对硬件非常友好，通过 4-bit 量化技术，我们甚至可以在一张只有 6GB 显存 的家用显卡（如 RTX 3060）上流畅运行。今天我们就来手把手完成它的本地部署，并写一个属于自己的对话 Demo。

一、硬件与环境准备

在开始之前，请确认你的机器满足以下要求：

显卡：建议显存 ≥ 6GB (INT4模式) 或 ≥ 14GB (FP16模式)。
操作系统：Windows (建议使用 WSL2) 或 Linux (Ubuntu)。
开发工具：已安装 Anaconda 或 Miniconda。

根据你的 CUDA 版本安装 PyTorch，并安装 Transformers 等核心库：

# 安装 PyTorch (以 CUDA 11.8 为例)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

如果是直接选用配置好Pytorch的服务器来做的话就可以跳过上一步，进入下一步。

# 克隆代码库并安装依赖
git clone https://github.com/THUDM/ChatGLM2-6B

在这里插入图片描述

cd ChatGLM2-6B
pip install -r requirements.txt

在这里插入图片描述

二、下载模型权重

由于模型权重文件较大（约 12GB），国内用户推荐使用 ModelScope (魔搭社区) 下载，速度更快：

# 安装 modelscope
pip install modelscope

在这里插入图片描述

python -c "from modelscope import snapshot_download; snapshot_download('ZhipuAl/chatglm2-6b', cache dir='./')"

在这里插入图片描述
下载过程相对来说长一点，根据网络情况浮动，大约需要二十分钟左右

三、测试

创建python输入以下代码：

from transformers import AutoTokenizer, AutoModel
import torch

# 1. 模型路径（如果是当前目录，直接填文件夹名）
model_path = "ZhipuAI/chatglm2-6b" 

# 2. 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 3. 加载模型（根据显存选择一种）

# 方案 A：显存 > 13G (FP16)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).cuda()

# 方案 B：显存 6G-8G (INT4 量化 - 最常用)
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).quantize(4).cuda()

# 方案 C：无显卡 (CPU 模式)
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float()

model = model.eval()

# 4. 开始对话
response, history = model.chat(tokenizer, "你好，请介绍一下你自己", history=[])
print(response)

并保存为test.py

在终端中执行：

python my_chat.py

出现以下界面大概率就是可以的了，这个进度条是正在加载权重文件，每次重新运行程序的时候需要加载
在这里插入图片描述

出现以下界面，也就是AI给的答复，那就是ok了的，这是对我们程序里面的那句话你好，请介绍一下你自己的答复
在这里插入图片描述

三、编写对话 Demo 程序

ChatGLM2本身有提供web程序，但是由于部分服务器并没有分配公网IP，并且为了不依赖复杂的 Web 界面，我们直接写一个高效的 终端流式对话程序。流式对话可以让模型像“打字机”一样实时输出结果，体验极佳。

新建文件 demo.py，粘贴以下代码：

import os
import sys
import torch
import warnings
from transformers import AutoTokenizer, AutoModel, logging

# ================= 1. 环境与日志优化 =================
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
logging.set_verbosity_error()
warnings.filterwarnings("ignore")

# 强制 Python 的标准输入输出使用 UTF-8 编码，防止 Windows 报错
if sys.platform == "win32":
    import io
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
    sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')

# ================= 2. 加载模型 =================
model_path = "ZhipuAI/chatglm2-6b"
print("正在启动 AI 引擎，请稍候...", flush=True)

try:
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_path, trust_remote_code=True).quantize(4).cuda()
    model = model.eval()
except Exception as e:
    print(f"模型加载失败，请检查显存或路径: {e}")
    sys.exit(1)

# ================= 3. 对话逻辑 =================
def main():
    history = []
    os.system('cls' if os.name == 'nt' else 'clear')
    
    print("\n" + "="*50)
    print("ChatGLM2-6B 本地助手已就绪！")
    print("Tips: 如果显示乱码，请先执行 'chcp 65001'")
    print("输入 'clear' 重置记忆，输入 'exit' 退出")
    print("="*50 + "\n")

    while True:
        try:
            # 兼容性处理：手动读取输入并清理掉可能导致崩溃的字符
            print("用户 >>> ", end="", flush=True)
            query = sys.stdin.readline().strip()
            
            if not query: continue
            if query.lower() == "exit": break
            if query.lower() == "clear":
                history = []
                os.system('cls' if os.name == 'nt' else 'clear')
                print("记忆已清空。")
                continue

            print("AI >>> ", end="", flush=True)
            
            current_length = 0
            # stream_chat 会自动处理 history，确保上下文连贯
            for response, history in model.stream_chat(tokenizer, query, history=history):
                # 打印增量部分
                new_text = response[current_length:]
                print(new_text, end="", flush=True)
                current_length = len(response)
            print() 

        except UnicodeDecodeError:
            print("\n[错误] 检测到编码异常，请尝试输入纯文字，或在运行前执行 'chcp 65001'")
        except KeyboardInterrupt:
            print("\n对话已终止。")
            break
        except Exception as e:
            print(f"\n[运行异常]: {e}")

if __name__ == "__main__":
    main()

在这里插入图片描述

四、运行与对话

在终端中执行：

python my_chat.py

在这里插入图片描述

当看到 ChatGLM2-6B 本地助手已就绪！ 后，你就可以开始使用它了！

对话示例：

在这里插入图片描述

五、常见坑点与优化

显存溢出 (OOM)：如果你的显存只有 6G-8G，必须使用 .quantize(4)。如果没有显卡，只能用 CPU 运行（加 .float()），但响应时间可能长达数分钟。
多轮对话变慢：随着对话轮数增加，history 列表会变长，占用更多内存。建议每隔一段时间输入 clear 重置。
Tokenizer 警告：第一次运行如果看到“Migrating old cache”，请耐心等待，这是 Transformers 库在升级缓存结构。

结语

ChatGLM2-6B 是目前个人开发者最值得尝试的模型之一。通过本文的部署，你已经在本地拥有了一个不联网、保护隐私且完全可控的 AI 助手。接下来，你可以尝试接入你的知识库（RAG）或进行 LoRA 微调！

如果你觉得这篇文章有帮助，请点赞并收藏，欢迎在评论区讨论你遇到的问题！

阅读原文

本地部署 ChatGLM2-6B【保姆級教程】：從零搭建你的中英雙語大模型對話助手

从零整理在本機可用的 ChatGLM2-6B：硬體與視訊記憶體評估、環境與套件、權重下載、推理自檢以及最精簡的串流對話程式，並附上常見錯誤與縮載技巧。

來源：https://blog.csdn.net/2403_87969572/article/details/156704235

抓取時間（ISO本地）：2026-05-18 05:17:29

前言

ChatGLM2-6B 是清華大學開源的、在中英雙語上表現卓越的對話模型。它不僅效能強勁，更重要的是它對硬體非常友好，透過 4-bit 量化技術，我們甚至可以在一張只有 6GB 視訊記憶體 的家用顯示卡（如 RTX 3060）上流暢執行。今天我們就來手把手完成它的本地部署，並寫一個屬於自己的對話 Demo。

一、硬體與環境準備

在開始之前，請確認你的機器滿足以下要求：

顯示卡：建議視訊記憶體 ≥ 6GB (INT4模式) 或 ≥ 14GB (FP16模式)。
作業系統：Windows (建議使用 WSL2) 或 Linux (Ubuntu)。
開發工具：已安裝 Anaconda 或 Miniconda。

根據你的 CUDA 版本安裝 PyTorch，並安裝 Transformers 等核心庫：

# 安裝 PyTorch (以 CUDA 11.8 為例)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

如果是直接選用配置好Pytorch的伺服器來做的話就可以跳過上一步，進入下一步。

# 克隆程式碼庫並安裝依賴
git clone https://github.com/THUDM/ChatGLM2-6B

在這裡插入圖片描述

cd ChatGLM2-6B
pip install -r requirements.txt

在這裡插入圖片描述

二、下載模型權重

由於模型權重檔案較大（約 12GB），國內使用者推薦使用 ModelScope (魔搭社群) 下載，速度更快：

# 安裝 modelscope
pip install modelscope

在這裡插入圖片描述

python -c "from modelscope import snapshot_download; snapshot_download('ZhipuAl/chatglm2-6b', cache dir='./')"

在這裡插入圖片描述
下載過程相對來說長一點，根據網路情況浮動，大約需要二十分鐘左右

三、測試

建立python輸入以下程式碼：

from transformers import AutoTokenizer, AutoModel
import torch

# 1. 模型路徑（如果是當前目錄，直接填資料夾名）
model_path = "ZhipuAI/chatglm2-6b" 

# 2. 載入 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 3. 載入模型（根據視訊記憶體選擇一種）

# 方案 A：視訊記憶體 > 13G (FP16)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).cuda()

# 方案 B：視訊記憶體 6G-8G (INT4 量化 - 最常用)
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).quantize(4).cuda()

# 方案 C：無顯示卡 (CPU 模式)
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float()

model = model.eval()

# 4. 開始對話
response, history = model.chat(tokenizer, "你好，請介紹一下你自己", history=[])
print(response)

並儲存為test.py

在終端中執行：

python my_chat.py

出現以下介面大機率就是可以的了，這個進度條是正在載入權重檔案，每次重新執行程式的時候需要載入
在這裡插入圖片描述

出現以下介面，也就是AI給的答覆，那就是ok了的，這是對我們程式裡面的那句話你好，請介紹一下你自己的答覆
在這裡插入圖片描述

三、編寫對話 Demo 程式

ChatGLM2本身有提供web程式，但是由於部分伺服器並沒有分配公網IP，並且為了不依賴複雜的 Web 介面，我們直接寫一個高效的 終端流式對話程式。流式對話可以讓模型像“打字機”一樣實時輸出結果，體驗極佳。

新建檔案 demo.py，貼上以下程式碼：

import os
import sys
import torch
import warnings
from transformers import AutoTokenizer, AutoModel, logging

# ================= 1. 環境與日誌最佳化 =================
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
logging.set_verbosity_error()
warnings.filterwarnings("ignore")

# 強制 Python 的標準輸入輸出使用 UTF-8 編碼，防止 Windows 報錯
if sys.platform == "win32":
    import io
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
    sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')

# ================= 2. 載入模型 =================
model_path = "ZhipuAI/chatglm2-6b"
print("正在啟動 AI 引擎，請稍候...", flush=True)

try:
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_path, trust_remote_code=True).quantize(4).cuda()
    model = model.eval()
except Exception as e:
    print(f"模型載入失敗，請檢查視訊記憶體或路徑: {e}")
    sys.exit(1)

# ================= 3. 對話邏輯 =================
def main():
    history = []
    os.system('cls' if os.name == 'nt' else 'clear')
    
    print("\n" + "="*50)
    print("ChatGLM2-6B 本地助手已就緒！")
    print("Tips: 如果顯示亂碼，請先執行 'chcp 65001'")
    print("輸入 'clear' 重置記憶，輸入 'exit' 退出")
    print("="*50 + "\n")

    while True:
        try:
            # 相容性處理：手動讀取輸入並清理掉可能導致崩潰的字元
            print("使用者 >>> ", end="", flush=True)
            query = sys.stdin.readline().strip()
            
            if not query: continue
            if query.lower() == "exit": break
            if query.lower() == "clear":
                history = []
                os.system('cls' if os.name == 'nt' else 'clear')
                print("記憶已清空。")
                continue

            print("AI >>> ", end="", flush=True)
            
            current_length = 0
            # stream_chat 會自動處理 history，確保上下文連貫
            for response, history in model.stream_chat(tokenizer, query, history=history):
                # 列印增量部分
                new_text = response[current_length:]
                print(new_text, end="", flush=True)
                current_length = len(response)
            print() 

        except UnicodeDecodeError:
            print("\n[錯誤] 檢測到編碼異常，請嘗試輸入純文字，或在執行前執行 'chcp 65001'")
        except KeyboardInterrupt:
            print("\n對話已終止。")
            break
        except Exception as e:
            print(f"\n[執行異常]: {e}")

if __name__ == "__main__":
    main()

在這裡插入圖片描述

四、執行與對話

在終端中執行：

python my_chat.py

在這裡插入圖片描述

當看到 ChatGLM2-6B 本地助手已就緒！ 後，你就可以開始使用它了！

對話示例：

在這裡插入圖片描述

五、常見坑點與最佳化

視訊記憶體溢位 (OOM)：如果你的視訊記憶體只有 6G-8G，必須使用 .quantize(4)。如果沒有顯示卡，只能用 CPU 執行（加 .float()），但響應時間可能長達數分鐘。
多輪對話變慢：隨著對話輪數增加，history 列表會變長，佔用更多記憶體。建議每隔一段時間輸入 clear 重置。
Tokenizer 警告：第一次執行如果看到“Migrating old cache”，請耐心等待，這是 Transformers 庫在升級快取結構。

結語

ChatGLM2-6B 是目前個人開發者最值得嘗試的模型之一。透過本文的部署，你已經在本地擁有了一個不聯網、保護隱私且完全可控的 AI 助手。接下來，你可以嘗試接入你的知識庫（RAG）或進行 LoRA 微調！

如果你覺得這篇文章有幫助，請點贊並收藏，歡迎在評論區討論你遇到的問題！

Local ChatGLM2-6B Deploy: Bilingual Chat Assistant from Scratch

A from-zero local deployment walkthrough for THU ChatGLM2-6B: GPU/RAM expectations, environment setup, weight download, a minimal inference sanity check and streaming CLI demo, plus common pitfalls.

Captured at (local ISO): 2026-05-18 05:17:29

Introduction

ChatGLM2-6B from Tsinghua is a strong bilingual chat model. With 4-bit quantization it can run on 6 GB GPUs (e.g. RTX 3060). This guide deploys locally and builds a terminal streaming demo.

1. Hardware and environment

GPU: ≥6 GB (INT4) or ≥14 GB (FP16).
OS: Windows (WSL2) or Linux.
Tools: Anaconda or Miniconda.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Skip if your image already has PyTorch.

git clone https://github.com/THUDM/ChatGLM2-6B

在这里插入图片描述

cd ChatGLM2-6B
pip install -r requirements.txt

在这里插入图片描述

2. Download weights

~12 GB — use ModelScope in China:

pip install modelscope

在这里插入图片描述

python -c "from modelscope import snapshot_download; snapshot_download('ZhipuAl/chatglm2-6b', cache dir='./')"

在这里插入图片描述
Download may take ~20 minutes.

3. Smoke test

from transformers import AutoTokenizer, AutoModel
import torch

model_path = "ZhipuAI/chatglm2-6b" 

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# FP16 if VRAM > 13G
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).cuda()

# INT4 for 6–8G
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).quantize(4).cuda()

# CPU
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float()

model = model.eval()

response, history = model.chat(tokenizer, "你好，请介绍一下你自己", history=[])
print(response)

Save as test.py, run python test.py.

在这里插入图片描述

4. Terminal streaming demo

Create demo.py (full script preserved from source—UTF-8 fixes on Windows, INT4 load, stream_chat loop). Key points:

model.quantize(4).cuda() for 6–8 GB VRAM.
Commands: clear resets history, exit quits.
On Windows run chcp 65001 if garbled.

在这里插入图片描述

5. Run chat

python demo.py

在这里插入图片描述

6. Pitfalls

OOM: Must use .quantize(4) on 6–8 GB; CPU .float() is very slow.
Slow multi-turn: history grows—type clear periodically.
Tokenizer cache migration: Wait on first-run cache upgrade warnings.

Closing

You now have a private, offline ChatGLM2 assistant. Next steps: RAG or LoRA fine-tuning.

阅读原文

前言

一、 硬件与环境准备

二、 下载模型权重

三、 测试

三、 编写对话 Demo 程序

四、 运行与对话

五、 常见坑点与优化

结语

文章目錄

前言

一、 硬體與環境準備

二、 下載模型權重

三、 測試

三、 編寫對話 Demo 程式

四、 執行與對話

五、 常見坑點與最佳化

結語

Introduction

1. Hardware and environment

2. Download weights

3. Smoke test

4. Terminal streaming demo

5. Run chat

6. Pitfalls

Closing

一、硬件与环境准备

二、下载模型权重

三、测试

三、编写对话 Demo 程序

四、运行与对话

五、常见坑点与优化

一、硬體與環境準備

二、下載模型權重

三、測試

三、編寫對話 Demo 程式

四、執行與對話

五、常見坑點與最佳化