The Gaza strip is smaller than the micronation of Andorra, and it's almost impossible to get out. Israel controls every exit but one, and they are heavily blockaded, creating a humanitarian crisis. A good map of Gaza's possible border crossings was made by The Palestinian Academic Society for the Study of International Affairs (PASSIA) and is here. The one border crossing into Egypt, the Rafah crossing, recently opened up to admit only injured Palestinians into Egypt. This border crossing and the nearby illicit tunnels may be what is driving much of the conflict. Israel wants to shut down the border crossing and nearby tunnels to cut off Hamas's access to supplies and therefore their ability to govern, whereas Hamas maybe be hoping to leverage the conflict to pressure Egypt to make the border crossing more open.
Being able to travel freely is one of the greatest privileges afforded to citizens of the first world, clearly illustrated in this map. This privilege is completely absent for the citizens of Gaza, trapped in a war in and a scant 139 square miles. To illustrate this, I took screenshots from MapFrappe of the area of Gaza overlaying different familiar areas of the world. Imagine being stuck somewhere a third the size of New York City - with rockets falling!
Sunday, July 13, 2014
Philippines Language Maps
LANGUAGE MAPSI love language maps, and looking at how languages interact with space. Much of the history of human movement through space can be re-constructed using linguistics and language maps. For example, linguistic data shows that the Malagasy of Madagascar have their origins in Borneo, Indonesia. Or that people as far flung as the Irish, Persians, Spanish, Armenians, Germans and Punjabis all speak related languages, and they all have cultural and ancestral roots in one group of people that once lived somewhere near the Black Sea.
However, even with a good understanding of where a language or group of languages are, they can be very difficult to depict on a map. Languages frequently overlap, or exist as linguistic continua that cannot be categorized into distinct languages. Language cartographers will often follow political boundaries, usually incorrectly. This map, for example, makes it appear that the use of English abruptly ends and Spanish begins at the US-Mexican border. It also looks like you are as likely to find French speakers in the far north of Quebec as you are in Montreal.
Another major issue with language maps is that they usually rely on perceptual data, but not on real observations of languages "in the wild". We all know that there is a boundary between Southern American English and Northern American English, but where exactly would you put the line? At the mason-dixon line? Well, no one really has a southern accent in DC or Baltimore, so what about somewhere across Virginia? It's tricky. The best solution, which linguistic geographers have been doing for years, involves large-scale surveys, asking people what they "would" say, recording where exactly they are, and then aggregating this data. This has led to some cool maps, like these ones, but is incredibly time- and labor-intensive.
I believe that recently a new solution has emerged to these problems in mapping languages and dialects. In the past few years, geotagged social media have become widely available, offering massive and readily available data sets for mapping everything from linguistic trends to sports fan domains to preferences for church vs beer. Maps made from such large, geotagged, linguistic corpora show real occurrences of linguistic phenomena, rather than just perceptual linguistic boundaries. Additionally, because such data is available in point form, it makes it much easier to display overlapping languages and linguistic continua. So, I decided to take a crack a this, and mapped the languages of the Philippines using tweet data:
To collect the tweets, I used the R package twitteR, a wrapper for the twitter API. I divided the Philippines into 1036 evenly-spaced points, and searched for all tweets within a 10 mile radius of the point, covering the whole area of the Philippines. I ran this every night for 5 days until I had one million tweets, of which about 25% were georeferenced.
In order to identify the language of the tweets, I needed corpora. A linguistic corpus (singular of corpora) is a large body of text in a given language. This large body of text is used to generate data, usually statistical signatures, that can be used to determine if a given sample text (like a tweet) has the same features as the corpora. So, in order to tell if a tweet is in, say, Hiligaynon, you need a lot of samples of Hiligaynon. To build these corpora, I got samples of literary and religious texts in Tagalog, Bikol, Ilokano, Hiligaynon, Pangasinense, Kapampangan, Cebuano and Waray from SEAlang. However, people speak quite differently in a religious or literary setting than they do while tweeting informally, so to get more "modern" samples of each language, I also built corpora by web scraping from Wikipedia using Python's BeautifulSoup package. I made a script that collects the body text from random Wikipedia articles (using the "Random Article" link), and I also made a script that starts at the page for the Philppines and collects text from every page that it links to, and then every page that those ones link to, etc. I used both scripts until I generated a corpora of 300,000 words for every one of the aforementioned languages except Hiligaynon, as well as for Chavacano de Zamboanga, which SEAlang did not have a corpus for. Hiligaynon has a beta-wiki with a couple hundered pages, and I used every single one of them to create a considerably smaller corpus.
Most language identification algorithms work by taking 3 and 4 letter samples of the corpus (called 3-grams and 4-grams, or just n-grams) and determining their distribution of the frequency of their occurrences. This is done for multiple languages and corpora. These n-grams are also sampled from the text to be identified, and the distribution of n-grams from the sample text is compared to the distribution of n-grams in the various corpora. Whichever corpora's distribution of n-grams most closely matches that of the n-grams of the sample text is determined to be the language of the sample text.
However, this method fails miserably for Austronesian languages, and exclusive word lists must be used. This is because Austronesian languages have relatively small phonemic inventories (Hawaiian only has eight consonants!) and almost all have a simple, Consonant-Vowel syllable structure. Thus there is not enough variability in possible n-grams, and languages cannot be classified based on n-gram distributions. Google translate uses the n-gram method, and when I use it for Tagalog, it frequently thinks that the text I am entering is in Indonesian or Cebuano. A more relevant example of this is in other attempts to map work languages. Two high-res language maps of twitter exist here and here. Outside of the Philippines, they are fantastic (just look at Europe), but I am assuming that they use the n-gram method, because they both mis-identify Tagalog (and other Filipino languages) as Indonesian:
Comparison to Other Maps
English vs Tagalog
Clearly, English is widely used throughout the Philippines, as is Tagalog. However, it appears that that Tagalog and Taglish (code-switching between English and Tagalog in one tweet) are more common in the North, the homeland of the Tagalog language. I was told by many Filipinos that southerners, who speak mostly Cebuano, resent the fact that Tagalog became the national language, since Cebuano was spoken by more people over a wider area. For this reason, Cebuanos use less Tagalog and more English. This map seems to confirm that, and taking the median point for Tagalog, English and Taglish shows a trend of Tagalog being more common in the north and English being more common in the south. Nevertheless, English was much more common than Tagalog, with 80,000 English tweets, 28,000 Tagalog tweets and 16,000 tweets with at least 3 words in Tagalog and 3 words in English.
Here is a map of the minority language tweets. As I discuss later, there were issues with language identification for these minority tweets, so their relative totals are probably not indicative of how prevalent they actually are on twitter, much less in everyday, spoken usage. Nevertheless, I think this map does a great job of displaying their geographic distributions. Hiligaynon (also called Ilonggo) is accurately shown as occurring on both Panay Island and Negros in the central islands (called the Visayas), as well as in the east of the southern island of Mindanao. Cebuano was also accurately identified across its full range, despite the fact that is has many different varieties, indicating that the corpus drew on a good variety of texts. This was not true for Bikol, which also has many varieties. Because the corpus from Wikipedia was based on Central Bikol, also called Naga Bikol, both of the Bikol tweets identified were found near Naga, and not in the larger city of Albay, which speaks a slightly different variety.
Many of the minority languages were very under-represented on twitter, with only one tweet coming in from Waray, two from Bikol, and nine from Kapampangan. In addition, no tweets were found in Pangasinse, spoken between Kapampangan and Ilocano, or Chavacano de Zamboanga, a creole based on Spanish and local languages, spoken in Mindanao.
Issues with Language IdentificationOne major issue with this study was that the corpora and and tweets were written in quite different registers. Filipinos use different words when they are writing tweets than when they are writing the bible. And even when they are using the same words, they spell those words differently. This is especially true for minority languages, which are almost entirely spoken and rarely used in a formal literary setting. English, on the other hand, is the main language used at school, business and in other settings that involve a lot of writing. Thus, Filipinos are much more likely to use "proper" English than they are "proper" Tagalog when writing tweets, and they would almost never never use the "proper" spelling and vocabulary of their minority languages. However, my corpora were based on the "proper" versions of these languages, and thus minority languages are quite underrepresented, and my study does not represent an accurate measure of, say, Waray usage versus English usage. Only about 125,000 of tweets 250,000 geotagged tweets could be classified, and I believe that most this missing half represent tweets written in "improper" Tagalog and minority languages (as well as a couple tweets in completely different languages, like Chinese or German).
For example, to say "I have not yet been to that new SM" would be, in proper Tagalog, "Hindi pa ako nakapunta sa yung bagong SM". However, a Filipino using social media would likely write something like "Di p aq nkpunta sa yn bago SM". This manner of writing has been taken to an almost incomprehensible level in the cultural phenomenon of jejemon, (similar to leetspeak in English) and cannot be identified as Tagalog from a corpus in "proper" Tagalog. Slang is also quite common in spoken Tagalog and in spoken minority languages, and is impossible to identify from a "proper" corpus. In fact, there are whole dialects of Tagalog based on slang, such as this code language used by gays. One Bikol word I learned when I was living the the Philippines was "uragon", which means strong or manly. However, this word was not in the Bikol corpus I generated: nowhere in the Bible, in literary texts or on the Bikol Wikipedia is it used.
Finally, the end result of the language classification required a great deal of cleaning and spot checking. This was because many of the minority languages' corpora contained words that were also in English slang, so many English tweets were misidentified as minority language tweets. For example "haha" is a word in Waray (or at least, is a word used in the Waray bible, Waray literary works, or the Waray Wikipedia) and "omg" is a word in Kapampangan. Since I decided that three matching words makes a tweet fall into a given language category, the following tweets were classified as Kapampangan and Waray:
"OMG ILOVEYOU OMG OMG OMG OMG ILOVEYOU DJ HUHUHU ILOVEYOU WAHHH"
"ge tawa tayo :( haha haha haha""
I had to do a lot of spot checking and re-running the classification before I ended up with results that I was satisfied with.
FURTHER PROJECTSI think there are a number of ways I could do this project better, if I ever wanted to use this for an academic conference or paper. A larger twitter data set could help capture more tweets from the more obscure minority languages. However, better language classification techniques could definitely make major improvements. I think the best way would be to build corpora from tweets themselves. To get this, I'd have to find native speakers from each minority language, give them a data set of a couple thousand tweets, and have them identify the ones that are in the minority languages they know. Then, these training tweets would be used to classify, say 1 million tweets. Another possibility would be to run an "unsupervised" classification. With this method, there are no training data set or corpora, but rather all of the tweets would be sorted into natural groups based on certain statistical features. I have done this before with pixels in remotely sensed satellite imagery, but I am not exactly sure how to go about this for text data, or if it has ever been done before.
I would also like to use this method that I have just developed in other places outside of the Philippines. I think Indonesia and Malaysia would be good candidates, as they are other areas with high twitter penetration, many minority languages, and Austronesian languages that cannot be classified with the n-gram method. Another possibility would be to map the languages of tweets in various international cities with a lot of linguistic diversity, like Singapore, London, New York and Hong Kong.
Monday, June 23, 2014
Most and least studied places on earth.Since I do research in Mali, I've spent a lot of time on sites like Google Scholar, Scopus and Web of Science looking for articles about similar research done there. But, there aren't a lot of articles on this obscure African country. I wondered how research on Mali compares to the amount done in other African countries, or even other countries in the world. So I decided to make a map (or 3).
And I was right, Mali is one of the more under-studied places on earth, but so are most African countries. Both per Capita and per Km, the DRC is the most understudied, which is a real shame. DRC is home to some incredible natural wealth and biodiveristy which is definitely deserving of more scholastic attention, as well as the most deadly conflict since World War II, which needs more attention from the world in general. I was surprised to see so many net results about China, but I guess there is a lot to study there.
These results just show the raw amount of articles returned, and don't show the subjects that research was done it, whether the research was done in books or articles, or even when the publishing was done. All of this would be super interesting to dig into, if only google made automated queries and web-scraping more easy.
Another major caveat is that some country names will only come up in English-language queries, for example "Germany" and "Democratic Republic of the Congo" will only return English results, whereas a country name like "Mali" will return results in most languages that academic articles are written in.
For countries whose names mean more than just the country, I tried to exclude other related terms. So, for Guinea, I had to exclude Guinea Worm, Equatorial Guinea, Guinea Pig, Guinea Fowl, etc.
I could have simply searched for every country on Google Scholar manually. But why do something repetitive and boring on a computer when that's exactly the kind of task that computers are good at doing? You just have to know how to tell a computer what exactly you want it to do. So I wrote a script in Python using the BeautifulSoup module to query Google Scholar for countries and return the number of articles about that country. But... google doesn't permit queries that aren't from browsers, or queries that are automated. So I had to disguise python as a browser, and I could only get about 40 results before Google Scholar stopped returning them. All in all, it took me a few days, whereas just doing it manually would have taken maybe an hour. But, hey, it was fun!
Friday, June 20, 2014
Cellular Automata inspired backgroundSome people in geography have tried to use the mathematical concept of cellular automata (CA) to understand changes in land cover and land use. I tried to capture this idea in an Escheresque image, which I have set as a background for the blog. It shows a gradual transition from an elementary Cellular Automata into a satellite image of agriculture and finally a city on a river. I generated the CA using NetLogo, and found a Landsat image of Minnesota online. It was a fun foray into photoshop.
Cellular automata are really interesting. The idea is that each cell can be black or white, and a cell's color changes depending upon the color of its neighbors. Every single cell follows the same exact rules. Yet, large-scale patters seem to emerge. One version of CA, known as Conway's game of life, has incredibly complex figures emerge, as seen in this epic youtube video.
The most fascinating thing about CA is that incredibly complex patterns can emerge from very simple rules. So it's only natural that geographers trying understand the complex patterns of land use change would turn to the relatively simple rules of CA.
In fact, the specific CA that I chose for my background, known as Rule 110 in elementary cellular automata, is possibly the simplest system in the world that is Turing-Complete. This means that this particular CA can run computations. In fact, given sufficient cells, it could run Windows XP!. With such astounding properties, CA-like rules could explain the emergences of things like landscapes and cities, or even reality.
Monday, January 20, 2014
Oh yea, and here's one from a grant I'm working on. I want to look at why Yanfolila has lower rates of anemia. Could it be because they don't grow cotton, and (seem to) have healthier, thicker forests? Give me some money, and I'll go find out! I served in Peace Corps in the extremely anemic region of Kolondieba.
Monday, December 30, 2013
One of my favorite places to hang out on the internet when I find free time is on reddit's map enthusiast community: www.reddit.com/r/MapPorn. Users can submit maps, comment on them, and vote them up or down. I've taken the time to make two maps that I've posted there, I figured I'd re-post them here.