CLIC-IT – 10a Conferenza italiana di Linguistica Computazionale
CLiC-it si propone di riunire ricercatori provenienti da diversi campi, tra cui la Linguistica Computazionale e l’Elaborazione del Linguaggio Naturale, la Linguistica, le Scienze Cognitive, il Machine Learning, l’Informatica, la Rappresentazione della Conoscenza, l’Information Retrieval e le Digital Humanities.
04 dicembre 2024
TUTORIAL 9:30-12:30
You Are what You Eat: Processing Data for Training and Evaluating LLMs
Data is becoming one of the most critical ingredients for developing Large Language Models (LLMs), as the models’ behavior is largely affected by both the amount and the quality of training data. In addition, high quality data drives LLMs evaluation, with relevant implications not only for research but also for the application market. Finally, if we focus on data about different languages, we can not ignore that the availability of such data is highly unbalanced toward very few languages. The tutorial addresses key aspects related to the use of textual data for LLMs, and, with less emphasis, of multimodal data. Specifically, we describe a pipeline for data preparation, covering, among other steps, collection, cleaning, deduplication, and filtering. We highlight some of the most used data repositories for training and finetuning LLMs, including multilingual data. We then survey the legal issues related to using data, including potential violation of regulations about copyright, about the privacy of personal information and about the potential generation of both offensive content and misinformation. When it comes to processing data for benchmarking LLMs, the tutorial shows recent works in this area, with particular emphasis on benchmarks for Italian, including those derived from English translations and those originally created in Italian.