Automatic construction of standard web corpus with noise minimization in NLP field

Organisé par le LIASD.

Over the years, the role of the public has changed from simple consumer to producer of information. In particular, new web 2.0 applications called Rich Internet Applications (such as discussion forums, blogs, wiki, etc.) allow users to easily add personalized content. The heterogeneity of existing relationships between individuals makes a collective view of their behavior impossible to be formed, so behavioral prediction must be individual.
Given the large number of individuals present in the forums, the amount of data that arises from their interactions is immense. Therefore, a problem of scaling arises, so how to proceed out a separate extraction of dimensions based on the structure constitutes the web document, to extract only the relevant information ?
Web 2.0 platforms are a set of quick and effective tools for collaboration, mutual help and sharing information between users in the form of virtual groups (wikis, blogs, forums, emails, and Social networks). For example, forums represent one of the main sources of rich and open data in constant evolution. They are organized and based on an asynchronous discussion (conversation) in the form of messages posted by the users around a subject, which are generally under different models : question-answer, around an idea, discussions initiated by each user, standard for general use or as a blog.
Researchers in the academic and industrial fields, especially linguists and companies, need a powerful tools to adapt and face this technological revolution. For this purpose, the key to resolve the issue of web data extraction is to locate the relevant information contained in a web text during the step of collection resulting from the HTML pages (forums, wiki, textual documents, product descriptions, etc.). A manual solution to extract this kind of data may seem difficult due to the time factor and resources requested, given the unevenness in the HTML pages structure. We will discuss and highlight the techniques and methods used in web data extraction field, then present our approach to both discover and discard the parasitic (noise) data to extract the essential information contained in a web page to construct a standard web corpora. Which serves to be the main subject in an analytical, statistical and linguistic study of NLP field, and then extract knowledge from this information with reasonable quality and time.

Informations pratiques

Jeudi 15 décembre 2016
Université Paris 8, UFR MITSIC, Salle A148
Horaires : 14h-15h30

Contact
Otmane Manad
Doctorant - Laboratoire LIASD

Événements passés

15 décembre 2016

Séminaire organisé par le LIASD.

Lieu : Université Paris 8 - Bâtiment A, salle A148 - De 14h à 15h30