Student project topics

1. Bots and Gender Profiling

https://pan.webis.de/clef19/pan19-web/author-profiling.html

Given a Twitter feed, determine whether its author is a bot or a human. In case of human, identify her/his gender.

Introduction

Social media bots pose as humans to influence users with commercial, political or ideological purposes. For example, bots could artificially inflate the popularity of a product by promoting it and/or writing positive ratings, as well as undermine the reputation of competitive products through negative valuations. The threat is even greater when the purpose is political or ideological (see Brexit referendum or US Presidential elections). Fearing the effect of this influence, the German political parties have rejected the use of bots in their electoral campaign for the general elections. Furthermore, bots are commonly related to fake news spreading. Therefore, to approach the identification of bots from an author profiling perspective is of high importance from the point of view of marketing, forensics and security.

Task

After having addressed several aspects of author profiling in social media from 2013 to 2018 in CLEF PAN (age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating whether the author of a Twitter feed is a bot or a human. Furthermore, in case of human, to profile the gender of the author.

As in previous years,  it is propose a multilingual perspective task:

  • English
  • Spanish

 

2. Cross-domain Authorship Attribution

https://pan.webis.de/clef19/pan19-web/author-identification.html

Given a fanfiction text, determine its author among a list of candidates.

Introduction

Authorship attribution is an important problem in information retrieval and computational linguistics but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may enable e.g. law enforcement to save lives. The most common framework for testing candidate algorithms is the closed-set attribution task: given a sample of reference documents from a restricted and finite set of candidate authors, the task is to determine the most likely author of a previously unseen document of unknown authorship. This task may be quite challenging in cross-domain conditions, when documents of known and unknown authorship come from different domains (e.g., thematic area, genre). In addition, it is often more realistic to assume that the true author of a disputed document is not necessarily included in the list of candidates.

Fanfiction refers to fictional forms of literature which are nowadays produced by admirers (‘fans’) of a certain author (e.g. J.K. Rowling), novel (‘Pride and Prejudice’), TV series (Sherlock Holmes), etc. The fans heavily borrow from the original work’s theme, atmosphere, style, characters, story world etc. to produce new fictional literature, i.e. the so-called fanfics. This is why fanfiction is also known as transformative literature and has generated a number of controversies in recent years related to the intellectual rights property of the original authors (cf. plagiarism). Fanfiction, however, is typically produced by fans without any explicit commercial goals. The publication of fanfics typically happens online, on informal community platforms that are dedicated to making such literature accessible to a wider audience (e.g. fanfiction.net). The original work of art or genre is typically refered to as a fandom.

PAN focuses on cross-domain attribution in fanfiction, a task that can be more accurately described as cross-fandom attribution in fanfiction. In more detail, all documents of unknown authorship are fanfics of the same fandom (target fandom) while the documents of known authorship by the candidate authors are fanfics of several fandoms (other than the target-fandom). In contrast to the PAN-2018 edition of this task, we focus on open-set attribution conditions, namely the true author of a text in the target domain is not necessarily included in the list of candidate authors.

Task

Given a set of documents (known fanfics) by a small number (up to 10) of candidate authors, identify the authors of another set of documents (unknown fanfics) in another target domain. Each candidate author has contributed at least one of the unknown fanfics, which all belong to the same target fandom. Some of the fanfics in the target domain were not written by any of the candidate authors. The known fanfics belong to several fandoms (excluding the target fandom), although not necessarily the same for all candidate authors. An equal number of known fanfics per candidate author is provided. In contrast, the unknown fanfics are not equally distributed over the authors. The text-length of fanfics varies from 500 to 1,000 tokens. All documents are in the same language that may be English, French, Italian, or Spanish.

3. Celebrity Profiling

https://pan.webis.de/clef19/pan19-web/celebrity-profiling.html

Given the English Social Media feed of a celebrity, determine their degree of fame, occupation, age, and gender.

Introduction

Celebrities are among the most prolific users of social media, promoting their personas and rallying followers. This activity is closely tied to genuine writing samples, rendering them worthy research subjects in many respects, not least author profiling.

Task

The Celebrity Profiling aims to predict four traits of a celebrity from their social media communication. The traits are the degree of fame, occupation, age, and gender. The social media communication is given as the teaser messages from past tweets. The goal is to develop a piece of software which predicts celebrity traits from the teaser history.

4. Suggestion Mining from Online Reviews and Forums

https://competitions.codalab.org/competitions/19955

classifying given sentences into suggestion and non-suggestion classes

Introduction

Suggestion mining can be defined as the extraction of suggestions from unstructured text, where the term ‘suggestions’ refers to the expressions of tips, advice, recommendations etc. Consumer opinions towards commercial entities like brands, services, and products are generally expressed through online reviews, blogs, discussion forums, or social media platforms. These opinions largely express positive and negative sentiments towards a given entity, but also tend to contain suggestions for improving the entity or advice to the fellow consumers. Traditional opinion mining systems mainly focus on calculating the sentiment distribution by means of Sentiment Analysis methods. A suggestion mining component can therefore extend the capabilities of traditional opinion mining systems, leading to a number of new use cases. Such systems can empower both public and private sectors by extracting suggestions which are spontaneously expressed on various online platforms, enabeling organisations to collect suggestions from much larger and varied sources.

Suggestion mining remains a relatively young area as compared to Sentiment Analysis, especially in the context of recent advancements in neural network based approaches for learning feature representations. This research could drive the engagement of both commercial entities, as well as the research communities working on problems like opinion mining, supervised learning, representation learning, etc. From a linguistic viewpoint, suggestion mining includes extra propositional aspects like mood and modality, sarcasm, compound sentences, etc. It is observed that in some cases the grammatical properties of a sentence can alone decide its label, while at times semantics can play a significant role.

Task

Suggestion mining task is comprise of two subtasks:

  1. Under this subtask, students  perform domain specific suggestion mining, where the test dataset will belong to the same domain as the training and development datasets, i.e. suggestion forum for windows platform developers.
  2. Under this subtask, students will perform cross domain suggestion mining, where train/development and test datasets will belong to separate domains. Train and development datasets will remain the same as subtask 1, while the test dataset will belong to a specific topic.