Text analysis, deep learning and statistics

Description

Objectives:

This course aims to provide tools and methodology to perform information extraction from textual data using two complementary approaches:
  1. Statistical analysis based on historical methods and baseline calculations.
  2. Deep learning, which proposes methods for classifying texts and identifying linguistic markers and patterns.


Content:

Statistics will mainly focus on the frequency analysis of words and their distributions in a corpus, through methods such as z-score, correspondence analysis or the calculation of co-occurrences. The deep learning part will focus on two standard text classification architectures: recurrent networks and convolutional networks. The study of the hidden layers of these networks (embedding, attention, TDS) will be considered in order to extract the linguistic information learned by these models and compare it to the information known in statistics.


Prerequisites:

Good notions of Python programming + baseline statistics and machine learning.
 

Skills to be acquired or developed:

At the end of the course, students should be able to:

  • Create a corpus and define metadata related to a given working hypothesis
  • Use appropriate statistical methods to fit any analytical needs
  • Programme a deep learning network for text classification
  • Extract linguistic information from the hidden layers of a deep learning network
Course outline

Lesson 1: Textual data analysis
Introductory course on statistical analysis of textual data.
Tutorials: Use of the Hyperbase Web platform with illustrative examples and exercises.

Lesson 2: Preprocessing the Text
Summaries of the different data formats and metadata encoding methods. Introduction to data labeling, tokenization, and standard text preprocessing methods.
Tutorials: Accessing text from the web, preprocessing text, create corpus and create a data base in Hyperbase Web.

Lesson 3: z-score and co-occurrence analysis
Course on z-score applied on textual data analysis. Calculation of the word distributions and co-occurrence based word vectors.
Tutorials: Practical studies based on the corpus of each student.

Lesson 4: Multivariate statistics and clustering
Course on correspondence analysis and hierarchical classification. Supervised and unsupervised approach for text classification.
Tutorials: Practical studies based on the corpus of each student.

Lesson 5: Deep learning for Natural Language Processing (NLP)
Introductory course on deep learning for NLP. Challenges, limits and expected added value.
Tutorials: Use of the Hyperbase Web platform with illustrative examples and exercises.

Lesson 6: Learning word embeddings
Course on word embedding from Count Vectors to Word2Vec. Study of the different types of word embedding.
Tutorials: Word embedding implementation in Python (Count Vectors, TF-IDF, CBOW, SkipGram…). Comparison between statistics and deep learning.

Lesson 7: Convolutional neural network (CNN)
Course on convolutional models for text classification. Study of CNN hidden layers for linguistic feature extractions.
Tutorials: Programming CNN models for text classification. Analyze of the hidden convolutional layers for linguistic markers extraction.

Lesson 8: Recurrent neural network (RNN)
Course on recurrent models for text classification. Study of attention layers for linguistic feature extractions.
Tutorials: Programming RNN models for text classification. Analyze of the attention layer for linguistic markers extraction.

Lesson 9: Go further with deep learning for text analysis
Overview of different architectures and tasks applied in computational linguistics. Hybrid network (CNN+RNN), GAN, Text generation and Question Answering.
Tutorials: Programming Hybrid network for text classification. Practical studies based on the corpus of each student.

Lesson 10: Final exam
Practical case study. From a given corpus, the student will answer to a list of question by using the tools and the methods learned during the course and/or available online on Hyperbase web.

Evaluation

Tutorials will represent 25% of the overall rating. Based on the student's participation and their answers to the exercises. Final exam represents 75%.

References

Python and NLP: https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

Skansi, Sandro (2018), Introduction to Deep Learning - From Logical Calculus to Artificial Intelligence. Springer 2018

L. Vanni, M. Corneli, D. Longree, D. Mayaffre, F. Precioso (2020) - "Key passages: from statistics to deep learning" In Text Analytics, Advances and Challenges - D. F. Iezzi, D. Mayaffre, M. Misuraca (Eds) - Springer 2020

Goyal P, Pandey S, Jain K (2018) Deep learning for natural language processing. Apress, Berkeley

L. Vanni, M. Ducoffe, D. Mayaffre, F. Precioso D. Longrée, et al. (2018) - "Text Deconvolution Saliency (TDS): a deep toolbox for linguistic analysis" In Proceedings Of The 56th Annual Meeting of the Association for Computational Linguistics - ACL 2018 [hal-01804310]

Hyperbase Web: http://hyperbase.unice.fr 

Teacher Face to face hours working hours ECTS
Elena Cabrio
Serena Villata
30 60 6