Text analysis, deep learning and statistics

Teacher Face to face hours working hours ECTS
Serena Villata (I3S) 30 60 6

Description

This course aims to provide tools and methodology to do information extraction from textual data using two complementary approaches:

  1. Statistical analysis based on historical methods and baseline calculations.
  2. Deep learning, which proposes methods for classifying texts and identifying linguistic markers and patterns.

Statistics will mainly focus on the frequency analysis of words and their distributions in a corpus, through methods such as z-score, correspondence analysis or the calculation of co-occurrences. The deep learning part will focus on two standard text classification architectures: recurrent networks and convolutional networks. The study of the hidden layers of these networks (embedding, attention, TDS) will be considered in order to extract the linguistic information learned by these models and compare it to the information known in statistics.

Learning outcomes

At the end of the course, students will be able to:

  • Create a corpus and define metadata related to a given working hypothesis
  • Use appropriate statistical methods to fit any analytical needs
  • Program a deep learning network for text classification
  • Extract linguistic information from the hidden layers of a deep learning network

Requirements

Good notions of Python programming + baseline statistics and machine learning.