# Chinese-Word-Vectors **Repository Path**: spring-23/Chinese-Word-Vectors ## Basic Information - **Project Name**: Chinese-Word-Vectors - **Description**: 100+ Chinese Word Vectors 上百种预训练中文词向量 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2020-06-25 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Chinese Word Vectors 中文词向量 This project provides 100+ Chinese Word Vectors (embeddings) trained with different **representations** (dense and sparse), **context features** (word, ngram, character, and more), and **corpora**. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks. Moreover, we provide a Chinese analogical reasoning dataset **CA8** and an evaluation toolkit for users to evaluate the quality of their word vectors. ## Reference Please cite the paper, if using these embeddings and CA8 dataset. Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018. ``` @InProceedings{P18-2023, author = "Li, Shen and Zhao, Zhe and Hu, Renfen and Li, Wensi and Liu, Tao and Du, Xiaoyong", title = "Analogical Reasoning on Chinese Morphological and Semantic Relations", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", year = "2018", publisher = "Association for Computational Linguistics", pages = "138--143", location = "Melbourne, Australia", url = "http://aclweb.org/anthology/P18-2023" } ``` A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper: Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper) ``` @incollection{qiu2018revisiting, title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings}, author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao}, booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data}, pages={209--221}, year={2018}, publisher={Springer} } ``` ## Format The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size. Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value. ## Pre-trained Chinese Word Vectors ### Basic Settings
Window Size | Dynamic Window | Sub-sampling | Low-Frequency Word | Iteration | Negative Sampling* |
5 | Yes | 1e-5 | 10 | 5 | 5 |
Word2vec / Skip-Gram with Negative Sampling (SGNS) | ||||
Corpus | Context Features | |||
Word | Word + Ngram | Word + Character | Word + Character + Ngram | |
Baidu Encyclopedia 百度百科 | 300d | 300d | 300d | 300d |
Wikipedia_zh 中文维基百科 | 300d | 300d | 300d | 300d |
People's Daily News 人民日报 | 300d | 300d | 300d | 300d |
Sogou News 搜狗新闻 | 300d | 300d | 300d | 300d |
Financial News 金融新闻 | 300d | 300d | 300d | 300d |
Zhihu_QA 知乎问答 | 300d | 300d | 300d | 300d |
Weibo 微博 | 300d | 300d | 300d | 300d |
Literature 文学作品 | 300d | 300d | 300d | 300d |
Complete Library in Four Sections 四库全书* |
300d | 300d | NAN | NAN |
Mixed-large 综合 Baidu Netdisk / Google Drive |
300d 300d |
300d 300d |
300d 300d |
300d 300d |
Positive Pointwise Mutual Information (PPMI) | ||||
Corpus | Context Features | |||
Word | Word + Ngram | Word + Character | Word + Character + Ngram | |
Baidu Encyclopedia 百度百科 | Sparse | Sparse | Sparse | Sparse |
Wikipedia_zh 中文维基百科 | Sparse | Sparse | Sparse | Sparse |
People's Daily News 人民日报 | Sparse | Sparse | Sparse | Sparse |
Sogou News 搜狗新闻 | Sparse | Sparse | Sparse | Sparse |
Financial News 金融新闻 | Sparse | Sparse | Sparse | Sparse |
Zhihu_QA 知乎问答 | Sparse | Sparse | Sparse | Sparse |
Weibo 微博 | Sparse | Sparse | Sparse | Sparse |
Literature 文学作品 | Sparse | Sparse | Sparse | Sparse |
Complete Library in Four Sections 四库全书* |
Sparse | Sparse | NAN | NAN |
Mixed-large 综合 | Sparse | Sparse | Sparse | Sparse |
Feature | Co-occurrence Type | Target Word Vectors | Context Word Vectors |
Word | Word → Word | 300d | 300d |
Ngram | Word → Ngram (1-2) | 300d | 300d |
Word → Ngram (1-3) | 300d | 300d | |
Ngram (1-2) → Ngram (1-2) | 300d | 300d | |
Character | Word → Character (1) | 300d | 300d |
Word → Character (1-2) | 300d | 300d | |
Word → Character (1-4) | 300d | 300d | |
Radical | Radical | 300d | 300d |
Position | Word → Word (left/right) | 300d | 300d |
Word → Word (distance) | 300d | 300d | |
Global | Word → Text | 300d | 300d |
Syntactic Feature | Word → POS | 300d | 300d |
Word → Dependency | 300d | 300d |
Corpus | Size | Tokens | Vocabulary Size | Description |
Baidu Encyclopedia 百度百科 |
4.1G | 745M | 5422K | Chinese Encyclopedia data from https://baike.baidu.com/ |
Wikipedia_zh 中文维基百科 |
1.3G | 223M | 2129K | Chinese Wikipedia data from https://dumps.wikimedia.org/ |
People's Daily News 人民日报 |
3.9G | 668M | 1664K | News data from People's Daily(1946-2017) http://data.people.com.cn/ |
Sogou News 搜狗新闻 |
3.7G | 649M | 1226K | News data provided by Sogou labs http://www.sogou.com/labs/ |
Financial News 金融新闻 |
6.2G | 1055M | 2785K | Financial news collected from multiple news websites |
Zhihu_QA 知乎问答 |
2.1G | 384M | 1117K | Chinese QA data from https://www.zhihu.com/ |
Weibo 微博 |
0.73G | 136M | 850K | Chinese microblog data provided by NLPIR Lab http://www.nlpir.org/wordpress/download/weibo.7z |
Literature 文学作品 |
0.93G | 177M | 702K | 8599 modern Chinese literature works |
Mixed-large 综合 |
22.6G | 4037M | 10653K | We build the large corpus by merging the above corpora. |
Complete Library in Four Sections 四库全书 |
1.5G | 714M | 21.8K | The largest collection of texts in pre-modern China. |