2024 Shanghai Maritime University Language, Data and Translation Academic Training Camp Series Reports (4)

 Development and application of corpus

September 12, 2024, will be the fourth day of academic training camp. Assessing Corpus Composition In an academic event entitled Assessing Corpus Composition, Professor Serge Sharoff is leading eager students to explore the vast ocean of computational linguistics. He first sketched out a magnificent picture of the history of the development of corpora, from the beginning of germination to the flourishing of the present, revealing the indispensable contribution of this field to the exploration of human knowledge. By combing through this historical context, Professor Sharoff skillfully builds a bridge between theory and practice and deeply analyzes how corpora play a key role in the information torrent of contemporary society.

 


The professor then leads the audience into the world of detailed classification of text types, each of which is like a linguistic treasure, endowed with unique analytical value and significance in computational linguistics research. He not only elaborated on the differences between various types of texts but also deeply analyzed how these differences subtly shaped the performance boundaries of natural language processing tasks, pointing out the direction for subsequent scientific research and exploration.

 

Entering the core of the practical operation, Professor Sharoff, with his keen insight, directly attacked the core challenge of corpus construction - data representation and validity. He stressed that these two indicators are not only the gold standard to measure the quality of the corpus, but also the cornerstone to ensure the scientific rationality of the research results. By comparing and analyzing the similarities and differences between the British National Corpus (BNC) and the Brown University Standard Corpus (BC), the professor not only shows the significant differences between the two in terms of scale, coverage area, and annotation accuracy but also adds a series of vivid cases to make abstract concepts pop out on paper and make people feel enlightened.

 


In the climax, Professor Sharoff cleverly integrated the cutting-edge Huggingface Transformer language model into the presentation, showing how to use these powerful pre-trained models to achieve efficient automatic classification and genre recognition of massive texts. This process not only greatly expands the boundaries of corpus application, but also injects unprecedented vitality and impetus into computational linguistics research. He stressed that the exploration of this research field is not only related to the innovation of technology, but also has a profound impact on our understanding of language, culture and even human society, and its practical significance and value are immeasurable.

 


Throughout the lecture, Professor Sharoff, with his profound knowledge, rigorous attitude and passionate speech style, aroused every student's keen interest and infinite imagination in the field of corpus research. This is not only a transfer of knowledge, but also a spiritual enlightenment, encouraging every aspiring young people to devote themselves to this great journey of exploring the mystery of language.


(Reported by the College of Foreign Languages office; Translated by Shi Ying)