CLIC-IT – 10th Italian Conference on Computational Linguistics

Bernardo Magnini and Giovanni Bonetta of FBK will present the tutorial ‘You Are what You Eat: Processing Data for Training and Evaluating LLMs’ during the 10th edition of the ‘Italian Conference on Computational Linguistics’.

Auditorium Area della Ricerca – CNR

Via Giuseppe Moruzzi, Pisa

CLiC-it aims to bring together researchers from different fields including Computational Linguistics and Natural Language Processing, Linguistics, Cognitive Science, Machine Learning, Computer Science, Knowledge Representation, Information Retrieval, and Digital Humanities.

04 dicembre 2024

TUTORIAL 9:30-12:30

You Are what You Eat: Processing Data for Training and Evaluating LLMs
Data is becoming one of the most critical ingredients for developing Large Language Models (LLMs), as the models’ behavior is largely affected by both the amount and the quality of training data. In addition, high quality data drives LLMs evaluation, with relevant implications not only for research but also for the application market. Finally, if we focus on data about different languages, we can not ignore that the availability of such data is highly unbalanced toward very few languages. The tutorial addresses key aspects related to the use of textual data for LLMs, and, with less emphasis, of multimodal data. Specifically, we describe a pipeline for data preparation, covering, among other steps, collection, cleaning, deduplication, and filtering. We highlight some of the most used data repositories for training and finetuning LLMs, including multilingual data. We then survey the legal issues related to using data, including potential violation of regulations about copyright, about the privacy of personal information and about the potential generation of both offensive content and misinformation. When it comes to processing data for benchmarking LLMs, the tutorial shows recent works in this area, with particular emphasis on benchmarks for Italian, including those derived from English translations and those originally created in Italian.

 


Privacy Notice
Pursuant to art. 13 of EU Regulation No. 2016/679 – General Data Protection Regulation and as detailed in the Privacy Policy for FBK event’s participants, we inform you that the event will be recorded and disclosed on the FBK institutional channels. In order not to be filmed or recorded, you can disable the webcam and/or mute the microphone during virtual events or inform the FBK staff who organize the public event beforehand.