Document Forensics

Analysing large numbers of documents is a common and time-consuming task. For instance, investigating unsavoury business practices (e.g., slavery, fraud, bribery) can involve processing large numbers of contracts, yearly reports and external (news) sources that may reflect on a company’s reputation and relations. Currently, this is a labour intensive task mainly using text search to identify relevant documents that are then manually processed.
In this project we will apply methods to extract the relevant concepts (e.g., the name of
suppliers, or the type of relationship between companies, executive management) from unstructured (e.g., news) as well as semi-structured (e.g., contracts and financial) documents to populate knowledge graphs and link them to publicly available knowledge graphs. These knowledge graphs should reflect the temporal binding and provenance of the extracted relations and properties. This will enable automated reasoning about companies and their relationships such as structure of ownership or supply chains and their dynamics. This will allow leveraging external news sources as well as document collections such as the Panama papers in investigations and due diligence processes to automatically identify suspicious entities that companies interact with, even if this interaction is indirect.
At the outset of the project, we will construct a small corpus of relevant questions for document forensics tasks, together with hand-crafted gold standard answers as a benchmark for project success. The questions will be grouped in sets of increasing difficulty: answerable over a single document, answerable over multiple documents, answerable only with background knowledge.
If successful, this project will open up possibilities to guarantee fair trade practices and substantially reduce the effort to comply with regulations that aim at combating money laundering, financing terrorism, bribery, etc.

Share

Building task-specific sentiment analysis mode an evaluation of the active learning approach

As automatic text analysis has become an established methodological field in the humanities and social sciences, one of the most sought after techniques is the automatic extraction of attitudes, emotions, judgments and opinions. Under the banner of sentiment analysis or opinion mining, these techniques have widely been used in scientific research as well as professional applications. Since sentiment can be defined and operationalized in multiple ways, and the expression of sentiment can differ greatly across domains, there is no single, universal sentiment analysis tool. Rather, dictionaries and models need to be tuned for specific use cases.
In this project we investigate the potential of a semi-supervised approach called
active learning as a potentially fast and powerful way to train customized, task-specific sentiment analysis models. The essence of active learning is that a human annotator interactively trains a machine learning model. An algorithm provides the annotator with the most relevant texts for improving the model, which greatly reduces the amount of texts that require coding, thus enabling researchers themselves to supervise the training process.
Previous studies show promising results, but focus mostly on document-level sentiment scores, and often in short social media messages. In this project we investigate the application for journalistic texts, incorporating the holder and target of sentiment. We evaluate whether active learning enables us to train new models (RQ1) and retrain existing models (RQ2) for better performance on specific sentiment attribution tasks, using the
Prodigy annotation tool1. Using two corpora (on terrorism and vaccinations, respectively), we develop two separate models for performing the same task in different domains. Additionally, two gold standard sets will be annotated independently from the active learning annotation process to detect possible bias caused by this particular approach. Based on these analyses, we discuss the potential applications of active learning for sentiment analysis.

Share

Postdoctoral Researchers: their (in)formal networks and further Career structure.

This project, ‘Postdoctoral Researchers: their (in)formal networks and further Career structure’, aims at investigating the current and future careers of postdoctoral researchers (postdocs) in the Netherlands, in relation to their social, professional and organisational networks. We attempt to establish a new, critical perspective in studying networks and employment relations in and outside academia given that existing theories in this area fail to do justice to their unique and complex situation (Goastellec et al., 2013; Teelken & Van der Weijden, 2018). There is an emerging literature on the role of networks in the people’s careers in professional organizations (Kokot, 2014), but not in the case of junior academics, or more specifically postdocs.
The employment situation of postdocs within academia is unique because they are highly educated and motivated, closely involved with and contributing to the primary process of academia (Häyrinen-Alestalo & Peltola 2006). However, they generally lack a longer-term perspective and tenured contracts (Van der Weijden et al., 2015). Outside academia their opportunities for employment are equality difficult (Zollner, 2016). Especially striking is the contradiction between the postdocs’ weak institutional employment relations against the strength of their own personal initiatives through their (in)formal networks.
In order to investigate the mutual interaction between (in)formal networks and organisational structures (Moser, 2013), we will use here the model as proposed by Gläser and Laudel (2015). They explain the peculiarities of academic careers in contrast with general career research by distinguishing three different types of careers through which academics move simultaneously:

  1. A Cognitive Career, which refers to the content of their work.
  2. The Community Career, related to the (in)formal networks.
  3. The Organisational Career, which contains typically a sequence of jobs.

Accordingly, our research question is: How and to what extent do postdocs shape their career development either in- or outside academia by using their (in)formal networks?

Share

Heattweet: Exploring the link between weather and aggression on social media

Recently, meteorological conditions (e.g., temperature) have been linked to expressed sentiment on social media (Baylis et al., 2018). In this project we focus on the influence of meteorological conditions on expressions of interpersonal and intergroup aggression in social media messages, and on a possible explanatory mechanism, i.e. strength of future orientation. Given the importance of social media in interpersonal and intergroup communication nowadays, expressions of aggression in social media messages may threaten societies’ interconnectedness and inclusiveness.
According to the model for CLimate, Aggression, and Self-control (CLASH; Van Lange, Rinderu, & Bushman, 2017), higher temperatures may increase aggression because they result in a weaker future orientation, which is linked to lower levels of self-control (e.g., Baumeister et al., 1994). However, some psychological experiments suggest that higher temperatures may actually inhibit aggression and promote prosocial behavior by enhancing relational mindsets (e.g., IJzerman & Semin, 2009) and affiliative motivation (e.g., Fay & Maner, 2012). To make things even more complicated, other resarch suggests a curvilinear relationship between temperature and aggression (Van de Vliert et al., 1999).
In the current project, we will explore the link between the daily temperature and other meteorological conditions in the Netherlands (data obtained from KNMI), and expressions of interpersonal and intergroup aggression extracted from social media data (provided by Coosto). Proxies for aggression include terms of abuse (i.e., swear words), and words specifically related to, e.g., racist discourse (e.g., Tulkens, 2016), hate speech, and cyberbullying (e.g., Del Vigna et al., 2017). In addition to existing word lists, dictionaries will be composed semi-automatically, using wordnet propagation, corpus comparison, and pattern extraction (Baccianella et al., 2010, Maks et al., 2014). Degree of future orientation will be assessed by detecting use of temporal references (e.g.,
tomorrow, next week; see Basic et al., 2018), and subsequently tested as explanatory mechanism.

Share

Linked Art Provenance

The goal of the Linked Art Provenance project is to support art-historical provenance research, with methods that automatically integrate information from heterogeneous sources. Art provenance regards the changes in ownership over time of artworks, involving actors, events and places. It is an important source of information for  researchers interested in the history of collections. An example of an art-historical research question where provenance information is indispensable is:
“Can we identify all the paintings from the collection of Pieter Cornelis Baron van Leyden (1717-88), which has been dispersed during an auction in the 19th century?”
The auction of the 117 paintings has been recorded in a catalogue [1], which can serve as a basis for recording a provenance trail for each of the paintings. The problem is, paintings back then did not have unambiguous identifiers: identification relied on textual descriptions. Currently, based on this textual description, a researcher has to manually search for sources of consequent provenance transactions. We aim to automate this process, by matching the textual descriptions of objects with new sources of art provenance information, such as databases, websites and digitized auction catalogues. To do so, relevant sources of provenance information have to be identified and retrieved information has to be normalized. Therefore, we will create data harmonization pipelines for different types of sources. A pipeline will consist of the following steps:

  1. Query formulation– transform the textual description of a painting in an appropriate query
  2. Data retrieval – retrieve data from the source
  3. Data conversion – convert the data into a standardized data model
  4. Entity linking – identify entities (e.g. actors, places) and link them to structured vocabularies
  5. Candidate event formulation – formulate candidate provenance events

The candidate events are evaluated by the provenance researcher, thereby giving the researcher full control over the resulting provenance trail.

Share

Linguistic Variation in Classical Hebrew: from Markov Models to Neural Networks

This project is a continuation of the AA 2017 project “Probabilistic Approach to
Linguistic Variation and Change in Biblical Hebrew”, which investigated whether the
so-called Standard Biblical Hebrew [SBH] books and Late Biblical Hebrew [LBH]
books exhibit enough internal consistency to confirm the traditional divisions into an
SBH and an LBH corpus. With a Markov Model [MM] of the clause, phrase, and partof-speech tendencies for each book in the Hebrew Bible, distances between books
were measured in order to cluster them based on similarity.
The results partly confirmed the scholarly consensus (e.g.: internal consistency
SBH), partly corrected it (e.g.: LBH is much more heterogeneous). In addition, both
the potential and the limitations of MMs became increasingly visible. MMs predict the
next state only based on the current state. However, in linguistic utterances, previous
states affect future states (e.g.: in the sequence [Subject] [Verb] [X], the presence of
[Subject] before [Verb] rules out that state [X] is [Subject]).
The new project will build upon the previous project by applying more complex
probabilistic models that are able to do justice to the sequential structure of natural
language: Recurrent Neural Networks (RNNs). RNNs have the ability to output the
next state depending not only on the current state, but also on the previous states.
Hence, structural dependencies in a sentence can be better modeled than by using
MMs. Therefore, the use of an RNN for this problem is a natural next step in better
understanding the linguistic variation. However, since RNNs have a more complex
structure, the results of the model are harder to interpret and to explain. The results
of the previous project can help in understanding the output of the RNN. We expect
that we will obtain a better model that will give more precise results as compared to
the MMs.

Share

The semantics of meaning: distributional approaches for studying philosophical text

Concepts such as schizophrenia, marriage or fact change through time. In philosophy,  these changes are studied in a small amount of scientific or scholarly texts at a time through very precise, subtle, manual analyses (close reading). In computational linguistics, the changes in question are studied in massive, generic corpora such as the whole of Wikipedia, by computational methods largely based on so-called ‘word embeddings’, representations of word meaning in a semantic space using vectors based purely on their surrounding words. The current challenge in philosophy is to obtain fine-grained analyses at a bigger scale (Betti & van den Berg 2016). The current challenge in computational linguistics is to detect non-trivial shifts of meaning, while increasing reliability by a firm methodological grasp of the real factors influencing the results (Hellrich & Hahn 2016).
In this project philosophers and computational linguists conduct an interdisciplinary pilot study with the aim of combining the strengths of both fields. We will rely on a test case from a corpus comprising the writings of the American philosopher W. V. Quine. The corpus is small from a computational linguistics point of view, but rather big from a philosophical point of view. The philosophers will provide a dataset, a test case and an evaluation set centering around subtle shifts on a number of concepts (such as
science,
fact , intuition). The computational linguists will apply an adaptation of word embeddings models for tiny data for this type of texts along the lines of Herbelot and Baroni 2017’s nonce2vec designed to learn embeddings from tiny data. The focus of the project will be methodological. The project will be considered successful if, next to a software release, an adequate evaluation method for this type of data and type of interdisciplinary projects will be developed at the end of the project.

Share