My role in the project is to collect
and analyse a corpus of British newspaper articles, published between 2000 and
2016, on the topic of human trafficking.
A corpus is nothing more than a
collection of texts, and corpus linguistics is the field that uses corpora (the
plural of corpus) to draw conclusions about language use.
Collecting corpora is not always a
straightforward endeavour, in particular when the topic under investigation is
sometimes misunderstood by the public, the media, and even legislators. For
instance, human trafficking may also be known as ‘modern slavery’, and may
encompass such crimes such as organ harvesting, forced labour, and domestic
servitude. Furthermore, there is an ongoing debate about whether sex work
should always be considered a form of exploitation and trafficking. Finally, it
may not always be clear whether someone has been trafficked (i.e. moved using
coercion or deception for the purposes of exploitation), or smuggled (i.e.
voluntarily but irregularly moved across borders), in particular as those who
volunteer, or more accurately pay, to be smuggled across borders are also often
deceived and exploited, both along the way and at the end point.
I used the Lexis Nexis database, mainly
for practical reasons: it has a great number of British newspaper articles,
I’ve used it before, and its output is compatible with a special Python script
that Chris Norton (University of Leeds) wrote for me a few years ago, that
separates articles and organises them in a useful folder structure.
One way of collecting articles is to go
into Lexis Nexis, input some very broad but relevant search terms, and manually
select all articles that actually discuss the topic under investigation. This
is what I did for my PhD research. However, this is an extremely time-consuming
method, as it requires wading through several millions of articles in order to
select maybe 50 – 90 thousand.
I only have three months in which to
complete my part of the project, and it would be a waste of time to spend all
three months just collecting data.
But that’s not necessary. Gabrielatos,
then at the University of Lancaster, published a paper in 2007 outlining the
data collection method used to create a corpus for the RASIM-project (more information
here:
http://ucrel.lancs.ac.uk/projects/rasim/). Gabrielatos’ (2007) method
entails collecting a sample corpus using two ‘core’ search terms; generating a
list of possible additional search terms from this sample corpus; testing these
possible additional search terms, and then using those that pass the test (as
well as the core search terms) to collect the full corpus. It’s a far less
time-consuming method, and although there is a slightly increased risk of
collecting articles that aren’t strictly relevant, it is also more systematic
and therefore replicable than just manually collecting articles.
So that’s what I did. I first created
three sample corpora:
1. One at the start of the period
(1/1/00-30/9/00)
2. One at the end of the period (1/1/16-30/9/16)
3. One in the middle of the period
(1/1/08-30/9/08)
I used a handful of core search terms,
rather than two. These included ‘slavery’, ‘forced labour’, ‘sexual
exploitation’, and ‘human trafficking’.
Of these search terms, ‘slavery’ produced
the highest number of articles that did not also mention any of the other
search terms, presumably because historical slavery is often still considered a
different thing than modern slavery (which is in itself worthy of examination).
As such, this search term is the threshold against which all other potential
search terms are tested.
I used these three sample corpora to
create key word lists. A key word list shows which words are
used much more often in the sample corpus compared to another, reference,
corpus. Rank 60 is the cut-off point for selecting additional search terms from
these key word lists, so only the top 60 words of every list were tested
against the threshold set by ‘slavery’. I also asked my co-investigators to
send me lists of words that they thought could be useful search terms, and
tested those, too.
Eventually, I used the core search
terms, and the additional search terms that passed the threshold test, and
collected the full corpus, which currently consists of slightly over 80 thousand
articles. Chris’s Python script initially only
recognised articles published by seven of the major British newspapers – which
is what it was originally intended to do. So I adapted it to recognise the
other news sources that we wanted to include in this project.
The next step is to actually conduct
analyses of this corpus, and I will be back to update you on that in December.
- Ilse Ras
References:
Gabrielatos, C. 2007. Selecting query
terms to build a specialised corpus from a restricted-access database. ICAME Journal 31, pp.5-43 (available here:
http://clu.uni.no/icame/ij31/ij31-page5-44.pdf)
Comments
Post a Comment