原文链接
数据提取部分 input部分由阅读材料 以及相关要求 组成,我们拟通过正则表达式来匹配文字,通过pandas 来处理xlsx文件
我们将,提取的数据存取为json格式,其中每条数据拥有两个字段input
与output
。 input由 prompt
+ 阅读文本
组成 output由 选项
+ 答案
组成
语文题目
抽取数据,加载excel中的数据
1 2 3 4 5 6 7 8 import pandas as pdimport re df = pd.read_excel('训练集-语文.xlsx' ) df = df.replace('.' , '.' , regex=True ) df = df.replace('(' , '(' , regex=True )
此为语文题目的处理函数,此处questions_with_answers
为每个阅读的题目部分,包含选择题与非选择题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 def chinese_multiple_choice_questions (questions_with_answers ): text = questions_with_answers question_pattern = re.compile (r'\d+\..*?(?=\d+\.|$)' , re.DOTALL) choice_pattern = re.compile (r'([A-D])\s*(.*?)(?=[A-D]|$|\n)' , re.DOTALL) questions = question_pattern.findall(text) multiple_choice_questions = [] short_answer_questions = [] for id ,question in enumerate (questions): if re.search(r'[A-D]' , question): choices = choice_pattern.findall(question) question_text = re.split(r'\n' , question.split('(' )[0 ])[0 ] pattern_question = re.compile (r'(\d+)\.(.*)' ) matches_question = str (id +1 )+'.' + pattern_question.findall(question_text)[0 ][1 ] multiple_choice_questions.append({ 'question' : matches_question, 'choices' : choices }) else : short_answer_questions.append(question.strip()) return multiple_choice_questions
抽取问题答案 依然使用正则表达式匹配每一题的答案,只需保存选择题答案,并且根据id重新编号。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 def chinese_multiple_choice_answers (questions_with_answers ): questions_with_answers = questions_with_answers.replace(" " , "" ).replace("\n" , "" ) choice_pattern = re.compile (r'(\d+)\.([A-Z]+)' ) short_pattern = re.compile (r'(\d+)\.([^A-Z]+)' ) choice_matches = choice_pattern.findall(questions_with_answers) short_matches = short_pattern.findall(questions_with_answers) choice_answers = {int (index): answer for index, answer in choice_matches} short_answers = {int (index): answer for index, answer in short_matches} sorted_choice_answers = sorted (choice_answers.items()) sorted_short_answers = sorted (short_answers.items()) answers = [] for id in range (len (sorted_choice_answers)): answers.append(f"{id +1 } . {sorted_choice_answers[id ][1 ]} " ) return answers
prompt生成 prompt由阅读文本 +要求 组成
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def get_prompt_cn(text): prompt = f''' 你是⼀个⾼考选择题出题专家,你出的题有⼀定深度,你将根据阅读文本,出4道单项选择题,包含题目选项,以及对应的答案,注意:不⽤给出原文,每道题由1个问题和4个选项组成,仅存在1个正确答案,请严格按照要求执行。 阅读文本主要是中文,你出的题目需要满足以下要点,紧扣文章内容且题干和答案为中文: ###回答要求 (1)理解文中重要概念的含义 (2)理解文中重要句子的含意 (3)分析论点、论据和论证方法 ###阅读文本 {text} ''' return prompt
处理所有数据 设计好了所有函数,接下来就是在主函数中将数据拆分组合为input与output部分。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def process_cn (df ): res_input = [] res_output = [] for id in range (len (df)): data_options = df.loc[id , '选项' ] data_answers = df.loc[id ,'答案' ] data_prompt = df.loc[id ,'阅读文本' ] data_options = chinese_multiple_choice_questions(data_options) data_answers = chinese_multiple_choice_answers(data_answers) data_prompt = get_prompt_cn(data_prompt) if (len (data_answers)==len (data_options)): res = '' for id_,question in enumerate (data_options): res += f''' {question['question' ]} ? ''' +'\n' for choise in question['choices' ]: res = res+ choise[0 ] + choise[1 ]+ '\n' res = res + '答案:' + str (data_answers[id_].split('.' )[-1 ]) + '\n' res_output.append(res) res_input.append(data_prompt) return res_input,res_output
英语部分 英语部分与语文部分大同小异,这里我们只给出代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 def remove_whitespace_and_newlines (input_string ): result = input_string.replace(" " , "" ).replace("\n" , "" ).replace("." , "" ) return resultimport re text = """ 32. B. The underlying logic of the effect. 33.D. estimates were not fully independent. 34.C. The discussion process. 35.D. Approving. """ def get_answers (text ): text = remove_whitespace_and_newlines(text) pattern = re.compile (r'(\d)\s*([A-D])' ) matches = pattern.findall(text) res = [] for match in matches: number_dot, first_letter = match res.append(first_letter) return resdef get_prompt_en (text ): prompt = f''' 你是⼀个⾼考选择题出题专家,你出的题有⼀定深度,你将根据阅读文本,出4道单项选择题,包含题目选项,以及对应的答案,注意:不⽤给出原文,每道题由1个问题和4个选项组成,仅存在1个正确答案,请严格按照要求执行。 The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points: ### 回答要求 (1)Understanding the main idea of the main idea. (2)Understand the specific information in the text. (3)infering the meaning of words and phrases from the context ### 阅读文本 {text} ''' return prompt def process_en (df ): res_input = [] res_output = [] for id in range (len (df)): data_options = df.loc[id , '选项' ] data_answers = df.loc[id ,'答案' ] data_prompt = df.loc[id ,'阅读文本' ] data_options = get_questions(data_options) data_answers = get_answers(data_answers) data_prompt = get_prompt_en(data_prompt) if (len (data_answers)==len (data_options)): res = '' for id ,question in enumerate (data_options): res += f''' {id +1 } .{question['question' ]} {question['options' ]['A' ]} {question['options' ]['B' ]} {question['options' ]['C' ]} {question['options' ]['D' ]} answer:{data_answers[id ]} ''' +'\n' res_output.append(res) res_input.append(data_prompt) return res_input,res_output
数据合并 将中文与英文的数据合并,总共只有102条,为凑够150条,我们将中文前30条与英文前20条重复录入。
1 df_new = pd.DataFrame ({'input' : cn_input+cn_input[:30] +en_input+en_input[:20] , 'output' : cn_output+cn_output[:30] +en_output+en_output[:20] })
其中pd.DataFrame({…}):使用 pandas 库创建一个数据框。