Skip to main content

Ilse Ras reports on her research on British newspapers

My role in the project is to collect and analyse a corpus of British newspaper articles, published between 2000 and 2016, on the topic of human trafficking.
A corpus is nothing more than a collection of texts, and corpus linguistics is the field that uses corpora (the plural of corpus) to draw conclusions about language use.

Collecting corpora is not always a straightforward endeavour, in particular when the topic under investigation is sometimes misunderstood by the public, the media, and even legislators. For instance, human trafficking may also be known as ‘modern slavery’, and may encompass such crimes such as organ harvesting, forced labour, and domestic servitude. Furthermore, there is an ongoing debate about whether sex work should always be considered a form of exploitation and trafficking. Finally, it may not always be clear whether someone has been trafficked (i.e. moved using coercion or deception for the purposes of exploitation), or smuggled (i.e. voluntarily but irregularly moved across borders), in particular as those who volunteer, or more accurately pay, to be smuggled across borders are also often deceived and exploited, both along the way and at the end point.

I used the Lexis Nexis database, mainly for practical reasons: it has a great number of British newspaper articles, I’ve used it before, and its output is compatible with a special Python script that Chris Norton (University of Leeds) wrote for me a few years ago, that separates articles and organises them in a useful folder structure.

One way of collecting articles is to go into Lexis Nexis, input some very broad but relevant search terms, and manually select all articles that actually discuss the topic under investigation. This is what I did for my PhD research. However, this is an extremely time-consuming method, as it requires wading through several millions of articles in order to select maybe 50 – 90 thousand.
I only have three months in which to complete my part of the project, and it would be a waste of time to spend all three months just collecting data.
But that’s not necessary. Gabrielatos, then at the University of Lancaster, published a paper in 2007 outlining the data collection method used to create a corpus for the RASIM-project (more information here: 
http://ucrel.lancs.ac.uk/projects/rasim/). Gabrielatos’ (2007) method entails collecting a sample corpus using two ‘core’ search terms; generating a list of possible additional search terms from this sample corpus; testing these possible additional search terms, and then using those that pass the test (as well as the core search terms) to collect the full corpus. It’s a far less time-consuming method, and although there is a slightly increased risk of collecting articles that aren’t strictly relevant, it is also more systematic and therefore replicable than just manually collecting articles.

So that’s what I did. I first created three sample corpora:
1.       One at the start of the period (1/1/00-30/9/00)
2.       One at the end of the period (1/1/16-30/9/16)
3.       One in the middle of the period (1/1/08-30/9/08)
I used a handful of core search terms, rather than two. These included ‘slavery’, ‘forced labour’, ‘sexual exploitation’, and ‘human trafficking’.

Of these search terms, ‘slavery’ produced the highest number of articles that did not also mention any of the other search terms, presumably because historical slavery is often still considered a different thing than modern slavery (which is in itself worthy of examination). As such, this search term is the threshold against which all other potential search terms are tested.

I used these three sample corpora to create key word lists. A key word list shows which words are used much more often in the sample corpus compared to another, reference, corpus. Rank 60 is the cut-off point for selecting additional search terms from these key word lists, so only the top 60 words of every list were tested against the threshold set by ‘slavery’. I also asked my co-investigators to send me lists of words that they thought could be useful search terms, and tested those, too.

Eventually, I used the core search terms, and the additional search terms that passed the threshold test, and collected the full corpus, which currently consists of slightly over 80 thousand articles.  Chris’s Python script initially only recognised articles published by seven of the major British newspapers – which is what it was originally intended to do. So I adapted it to recognise the other news sources that we wanted to include in this project.

The next step is to actually conduct analyses of this corpus, and I will be back to update you on that in December.
-          Ilse Ras

References:
Gabrielatos, C. 2007. Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal 31, pp.5-43 (available here: http://clu.uni.no/icame/ij31/ij31-page5-44.pdf)

Comments

Popular posts from this blog

Dr Nina Muždeka explains what she will examine in her research

As a complex issue, transnational human trafficking invites  debate facilitated by the role of media as both a contemporary watchdog and a modern forum for showcasing diverse viewpoints. In the analysis of the transnational human trafficking coverage in the news media within the domain of narrative theory and the theoretical framework of poststructuralism, the following two aspects appear to be crucial: (1)  The role of news media, as a forum for expressing different opinions in relation to the causes and solutions to human trafficking, in the construction of public opinion and response to the issue, as well as in the formation and implementation of policy on human trafficking, exemplified by the choices they make in reporting on the issue, and (2)  The application of the contemporary narrative theory to the analysis of news media texts as means to construct meaning and reality, which details and explains the importance of the process of story-telling and the structural elements

Principal Investigator Dr Christiana Gregoriou on her research

As principal investigator of the project, I am analysing English print media-specific human trafficking representation. The analysis is critical discourse analytic and qualitative. For this part of the project, a sample corpus needed to be extracted from the large English language media text corpus (of around 80,000 texts) Ilse Ras’s research method generated (http://representinghumantrafficking.blogspot.co.uk/2016/11/ilse-ras-reports-on-her-research-on.html). We looked at graph spikes where large numbers of these human trafficking-related texts were generated; from the 2000-2016 period, the sample corpus texts were hence limited to the periods of April 2001, March 2007, November 2013, Summer 2015 and May 2016. Employing Laurence Anthony’s ProtAnt software so as to trace prototypical corpus texts within these spikes enabled the generation of a sample corpus of a manageable set of 67 news texts of various length and spike-distribution. The literature review around human trafficking