doc updates

This commit is contained in:
Joshua Widmann 2015-02-19 20:48:03 +01:00
commit 07d6a9cbaa

View file

@ -30,7 +30,7 @@ This basically gave us the date of the incident, the district where the incident
## More precise locations
To visualize the incidents on a map, the district on its own seemed to be an imprecise basis. This is why we decided to analyze the description text of each incident to acquire further location information, since most of the description does contain such. To be able to identify possible locations within a continous text we used the Part-Of-Speech Tagger from the [Natural Language Toolkit](http://www.nltk.org/) (NLTK) which assigns parts of speech to each word, such as noun, verb and adjective ([analyze.py](analyze.py)). By having these tags assigned to each word we can go through the text and extract those nouns and names that happen to appear after a preposition (the tags *APPR* and *APPRART*). We defined these to be most likely further information on the incident location such as train stations and street names.
Doing this we realized that some of the words we extracted from the text were actually completely irrelevant (*hair*, *face*, *woman*, *evening* and the like). We supposed to be able to identify these irrelevant ones by simply querying a german dictionary whether it contains this word or not, to tell whether this word is a relevant location or just an irrelevant noun from the german language. We ended up checking a text file, containing around 24,000 german nouns.
Doing this we realized that some of the words we extracted from the text were actually completely irrelevant (*hair*, *face*, *woman*, *evening* and the like). We supposed to be able to identify these irrelevant ones by simply querying a german dictionary whether it contains this word or not, to tell whether this word is a relevant location or just an irrelevant noun from the german language. We ended up checking a text file, containing 24,715 german nouns.
## Categorizing incidents
As we examined some of the incident descriptions we came to the conclusion that it is possible to group most of the incidents by certain categories. We recognized four major categories: *homophic* incidents, *antisemtitic* incidents, *sexist* incidents and *racist* incidents. To automatically assign distinctive categories to each incident we implemented a simple algorithm which searches for certain keywords (see *bad_words* below) in the incident description. This way an incidents can be tagged with none or multiple categories ([analyze.py](analyze.py)). We stored these assignements in a seperate table of our SQLite database called *category* with the columns *ID*, *Name* and *Article_ID*.