stylistic update based on Codacy

ymcui · ymcui · commit 62f988d4d32e · 2023-06-25T20:56:20.000+08:00
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -1,14 +1,14 @@
 # 笔记本示例 Notebooks
 
-###  ceval_example_for_chinese_alpaca.ipynb
+### ceval_example_for_chinese_alpaca.ipynb
 
 利用Chinese Alpaca模型解码C-Eval数据集的示例。
 
 Example of decoding C-Eval dataset with Chinese Alpaca.
 
 建议查看Colab上的最新版 / Check latest notebook：<a href="https://colab.research.google.com/drive/12YewimRT7JuqJGOejxN7YG8jq2de4DnF?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
-###  convert_and_quantize_chinese_llama_and_alpaca.ipynb
+### convert_and_quantize_chinese_llama_and_alpaca.ipynb
 
 Colab上的转换和量化中文LLaMA/Alpaca（含Plus版本）的运行示例（仅供流程参考）。
 
@@ -40,7 +40,7 @@ Example of running the Gradio demo on Colab.
 
 在Colab中打开 / Open the notebook in Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ymcui/Chinese-LLaMA-Alpaca/blob/main/notebooks/gradio_web_demo.ipynb) 
 
-###  legacy/
+### legacy/
 
 旧版notebook，供参考，但不会再更新。
 
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,6 +1,6 @@
 # 代码与脚本 Code and Scripts
 
-###  training/
+### training/
 
 预训练与指令精调代码，Wiki：
 
@@ -12,13 +12,13 @@ Pre-training and instruction finetuning code, Wiki:
 - Pre-training: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Pretraining-Script
 - Instruction finetuning: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/SFT-Script
 
-###  inference/
+### inference/
 
 使用🤗transformers进行推理，Wiki：[https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/使用Transformers推理](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/使用Transformers推理)
 
 Inference using 🤗transformers, Wiki: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Inference-with-Transformers
 
-###  langchain/
+### langchain/
 
 使用LangChain进行检索式问答和文本摘要的示例，Wiki：[https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/与LangChain进行集成](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/与LangChain进行集成)
 
@@ -30,25 +30,25 @@ Using LangChain for Retrieval QA and Summarization, Wiki: https://github.com/ymc
 
 A server that implements OPENAI API using fastapi, Wiki: [https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/API-Calls](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/API-Calls)
 
-###  merge_tokenizer/
+### merge_tokenizer/
 
 中文词表扩充代码，Wiki: [https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/训练细节#准备工作词表扩充](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/训练细节#准备工作词表扩充)
 
 Code for extending Chinese vocabulary, Wiki: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Training-Details#preparation-vocabulary-expansion
 
-###  merge_llama_with_chinese_lora.py
+### merge_llama_with_chinese_lora.py
 
 合并LLaMA/Alpaca LoRA脚本，Wiki: [https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换)
 
 Script for merging LLaMA/Alpaca LoRA. Wiki: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Manual-Conversion
 
-###  merge_llama_with_chinese_lora_low_mem.py
+### merge_llama_with_chinese_lora_low_mem.py
 
 （推荐）低资源版合并LLaMA/Alpaca LoRA脚本，Wiki: [https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换)
 
 （recommended）Script for merging LLaMA/Alpaca LoRA (low-resource version). Wiki: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Manual-Conversion
 
-###  crawl_prompt.py
+### crawl_prompt.py
 
 指令数据爬取脚本，Wiki：[https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/训练细节#训练数据](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/训练细节#训练数据)
 
diff --git a/scripts/ceval/evaluator.py b/scripts/ceval/evaluator.py
@@ -26,7 +26,7 @@ def generate_few_shot_prompt(self, subject, dev_df):
         for i in range(k):
             prompt += self.format_example(dev_df.iloc[i, :])
         return prompt
-    
+
     def eval_subject(self, subject_name, test_df, dev_df=None, few_shot=False, save_result_dir=None):
         pass
 
diff --git a/scripts/langchain/langchain_sum.py b/scripts/langchain/langchain_sum.py
@@ -15,10 +15,9 @@
 from langchain import HuggingFacePipeline
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.prompts import PromptTemplate
-from langchain.docstore.document import Document
 from langchain.chains.summarize import load_summarize_chain
 
-prompt_template = """Below is an instruction that describes a task. 
+prompt_template = """Below is an instruction that describes a task.
                     Write a response that appropriately completes the request.\n\n
                     ### Instruction:\n请为以下文字写一段摘要:\n{text}\n\n### Response: """
 refine_template = (
@@ -41,7 +40,7 @@
         device = torch.device(0)
     else:
         device = torch.device('cpu')
-    
+
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=100, length_function=len)
     with open(file_path) as f:
         text = f.read()
diff --git a/scripts/merge_llama_with_chinese_lora_low_mem.py b/scripts/merge_llama_with_chinese_lora_low_mem.py
@@ -210,7 +210,7 @@ def merge_shards(output_dir, num_shards: int):
         shards_merged = {}
         for d in shards_dicts:
             shards_merged |= d
-    
+
         print(f"Saving the merged shard to " + os.path.join(output_dir, f"consolidated.0{i}.pth"))
         torch.save(shards_merged, os.path.join(output_dir, f"consolidated.0{i}.pth"))
 
@@ -305,7 +305,7 @@ def merge_shards(output_dir, num_shards: int):
                         print(f"merging {lora_key_A} and lora_B.weight form {tl_idx}-th LoRA weight to {k}")
                     state_dict[k] += (
                         transpose(
-                            t_and_l['state_dict'][lora_key_B].float() 
+                            t_and_l['state_dict'][lora_key_B].float()
                           @ t_and_l['state_dict'][lora_key_A].float(), t_and_l['fan_in_fan_out']) * t_and_l['scaling']
                     )
             weight_size = state_dict[k].numel() * dtype_byte_size(state_dict[k].dtype)
diff --git a/scripts/merge_tokenizer/merge_tokenizers.py b/scripts/merge_tokenizer/merge_tokenizers.py
@@ -62,6 +62,5 @@
 text='''白日依山尽，黄河入海流。欲穷千里目，更上一层楼。
 The primary use of LLaMA is research on large language models, including'''
 print("Test text:\n",text)
-print
 print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
 print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")
diff --git a/scripts/openai_server_demo/README.md b/scripts/openai_server_demo/README.md
@@ -116,7 +116,7 @@ json返回体：
 
 `top_k`: 在随机采样（random sampling）时，前top_k高概率的token将作为候选token被随机采样。
 
-`top_p`: 在随机采样（random sampling）时，累积概率超过top_p的token将作为候选token被随机采样，越低随机性越大，举个例子，当top_p设定为0.6时，概率前5的token概率分别为[0.23, 0.20, 0.18, 0.11, 0.10]时，前三个token的累积概率为0.61，那么第4个token将被过滤掉，只有前三的token将作为候选token被随机采样。
+`top_p`: 在随机采样（random sampling）时，累积概率超过top_p的token将作为候选token被随机采样，越低随机性越大，举个例子，当top_p设定为0.6时，概率前5的token概率分别为{0.23, 0.20, 0.18, 0.11, 0.10}时，前三个token的累积概率为0.61，那么第4个token将被过滤掉，只有前三的token将作为候选token被随机采样。
 
 `repetition_penalty`: 重复惩罚，具体细节可以参考这篇文章：<https://arxiv.org/pdf/1909.05858.pdf> 。
 
diff --git a/scripts/openai_server_demo/openai_api_server.py b/scripts/openai_server_demo/openai_api_server.py
@@ -182,7 +182,7 @@ async def create_chat_completion(request: ChatCompletionRequest):
     else:
         msgs = [ChatMessage(role=x['role'],content=x['message']) for x in msgs]
     output = predict(
-        input=msgs, 
+        input=msgs,
         max_new_tokens=request.max_tokens,
         top_p=request.top_p,
         top_k=request.top_k,
@@ -200,7 +200,7 @@ async def create_chat_completion(request: ChatCompletionRequest):
 async def create_completion(request: CompletionRequest):
     """Creates a completion"""
     output = predict(
-        input=request.prompt, 
+        input=request.prompt,
         max_new_tokens=request.max_tokens,
         top_p=request.top_p,
         top_k=request.top_k,
diff --git a/scripts/training/run_clm_sft_with_peft.py b/scripts/training/run_clm_sft_with_peft.py
@@ -322,8 +322,8 @@ def main():
             files = [os.path.join(path,file.name) for file in path.glob("*.json")]
             logger.info(f"training files: {' '.join(files)}")
             train_dataset = buid_instruction_dataset(
-                data_path=files, 
-                tokenizer=tokenizer, 
+                data_path=files,
+                tokenizer=tokenizer,
                 max_seq_length=data_args.max_seq_length,
                 data_cache_dir = None, 
                 preprocessing_num_workers = data_args.preprocessing_num_workers)