Xunzi the LLM—A Way for People to Access Ancient Chinese Texts大型语言模型“荀子” 让人们接触中国古籍
Thousands of years ago, texts appeared on animal bones, bronzes, bamboo slips, and silk brocades before they were written on paper. But now these ancient Chinese texts have a new container.
In December 2023, a research team from Nanjing Agricultural University has rolled out Xunzi, a large language model (LLM) and XunziChat in association with Gulian, a professional ancient Chinese text publisher.
Wang Dongbo, the leader of the research team, said that the large language model was named after Xunzi because Xunzi was not only a prominent Confucian philosopher during the late Warring States Period (475 BC—221 BC), but also a pioneer in presenting and explaining theories of linguistics in ancient China.
When asked why he and his partners made the large language model, Wang explained that traditional Chinese characters, vertical layout, and the absence of pausing and punctuation are all obstacles that readers have to overcome when they read traditional texts.
To create Xunzi the LLM, Wang and his partners first did a lot of research. Since 2013, his team has worked tirelessly to digitize Chinese classics like the Siku Quanshu, or the Complete Library in Four Sections. “The hard work involves a large-scale corpus of two billion Chinese characters, which has laid a solid foundation for the large language model,” said Wang.
几千年前,文字先是写在兽骨、青铜器、竹简和织锦上,然后才被人们写在纸上。但如今,这些古老的中文文本已经有了新载体。
2023年12月,南京农业大学的一个研究团队,与一家专业的古籍出版公司古联联手,推出了大型语言模型荀子和荀子对话模型。
研究团队带头人王东波表示,该大型语言模型以荀子的名字命名,是因为荀子不仅是战国(公元前475年—公元前221年)晚期著名的儒学思想家,还是提出和解释中国古代语言学理论的先驱者。
当被问及他和他的同伴创建这个大型语言模型的原因时,王东波解释道:繁体字、竖版、缺少停顿和标点符号都是读者在阅读繁体文本时需要克服的障碍。
为了创建大型语言模型荀子,王东波和他的同伴们先做了大量的研究。自2013年以来,他的团队始终致力于将《四库全书》等中国经典书籍数字化。“经过辛勤努力,我们建立了20亿个汉字的大型语料库,为建立大型语言模型奠定了坚实的基础。”王东波说。
But their efforts seem to have paid off. Now Xunzi the LLM can tag, translate, punctuate, and understand scraps of ancient Chinese texts. It can even do part-of-speech analysis and retrieve specific information, such as names, events, and places from a text.
With this LLM, ancient Chinese texts can be accessed by more Chinese people, including students. For instance, if users type shangu into the chat box, they will not only discover what it is translated to but also see that it can refer to a person’s courtesy name in certain ancient Chinese texts. Through Xunzi’s retrieval function, users can get more specific cultural information based on courtesy names.
“The model can help us mine for more information hidden in our cultural legacy and find unnoticed models and connections,” said Wang.
But Wang and his team aren’t simply focused on target users in China. They are aiming at the rest of the world as well. They have shared the LLM on GitHub and other websites, allowing users to download and use it for free. “Our team is committed to the philosophy of making our data and model globally accessible. We hope this will encourage more people to appreciate excellent traditional Chinese culture,” Wang explained.
他们的努力似乎得到了回报。现在,大型语言模型荀子可以对中国古代文本的片段进行标记、翻译、加标点和阅读理解。它甚至可以进行词性分析并检索特定信息,如文本中的名称、事件和地点。
通过这个大型语言模型,包括学生在内的更多中国人,可以接触到中国古籍。例如,如果用户在聊天框中输入shangu的拼音,它不仅能识别出山谷一词,还会给用户指出与这个词相关的、古籍中一个中国文人的字等。通过荀子的检索功能,用户可以根据古人的字获取更具体的文化信息。
“这个模型可以帮助我们挖掘更多隐藏在文化遗产中的信息,找到未被注意到的样本和关联。”王东波说。
然而,王东波和他的团队不仅着眼于中国的目标用户,还将目光投向了世界其他地区。他们在GitHub和其他网站上共享了荀子,允许用户免费下载和使用。“我们团队秉持着让我们的数据和模型能在全球范围内被人们使用的理念,希望以此鼓励更多人了解中国优秀传统文化。”王东波解释道。
Word Bank
theory /'θɪəri/ n. 理论;原理
pause /pɔːz/ v. 暂停;停顿
The woman spoke almost without pausing for breath.
obstacle /'ɒbstəkl/ n. 障碍;阻碍
analysis /ə'næləsɪs/ n. (对事物的)分析
appreciate /ə'priːʃieɪt/ v. 欣赏;赏识
You can’t really appreciate foreign literature in translation.