with open("data.txt", "r", encoding="utf-8") as fp:
data = fp.read().strip().split("\n")
sentences = []
for d in data:
d = d.strip()
if "===" in d or len(d) == 0 or d == "《斗破苍穹》来自:":
continue
sentences.append(d)
with open("corpus.txt", "w", encoding="utf-8") as fp:
fp.write("\n".join(sentences))
复制代码
!pip install sentencepiece
复制代码
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: sentencepiece in e:\miniconda3\lib\site-packages (0.1.99)
DEPRECATION: Loading egg at e:\miniconda3\lib\site-packages\whisper_live-0.0.11-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")
复制代码
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Chinese-LLaMA tokenizer has been saved to merged_tokenizer_hf_test