Wednesday, October 1, 2014

Back in Mali!

Hello all! I am back in Mali to do research for my Master's thesis. It has been so wonderful to be back here in Mali with all of its craziness and joyfulness. Part of my work was in my old Peace Corps village of Kissa, and it was kind of a dream to be back, chatting with all my old friends, and hiking through all the old forests and fields I used to explore.  I have been resolute about taking lots of pictures this time, as I did not take nearly enough during my Peace Corps stay. So here are some of the more interesting and illustrative pictures.

 This is Toh, pretty much the national dish. You take a handfull of the goop, made from millet or corn flour and kind of textured like polenta, and then you dip it into one of the bowls of sauce. The sauce is made from combinations of okra, peanutbutter, tomatoes and hot peppers. With a good sauce, Toh can be delicious, and I thoroughly missed it while I was away.

This is how Malians do tea: with two shot-sized glasses, two little tea pots, and a lot of sugar! The tea is boiled down into a thick, syrupy shot, and one round of tea can provide about 3 to 6 people with a sip. Usually there are 2 or 3 rounds, taking place over the course of a conversation-filled hour. There is lots of pouring the tea back and forth, to thoroughly mix it in and to cool it off.

 Here is a hunter playing a hunters-guitar (donsongoni). There was a big celebration in Kolondieba for Malian independence day, and lots of hunters came in traditional garb with guns and musical instruments. I was invited to sit up with the Mayor, and the hunter was going around and singing to each person so that they would give him some money. He came straight to me, figuring the American would have the most money. I snapped some pictures and gave him some change. You put it directly in the guitar, actually, and the hunter rattles it around and makes it a part of the instrument.

 Here's an interesting picture: a satellite dish, surrounded by Mango trees, mud huts and thatch roofs. I was out for a walk and I came across this bugu-da, a household out in the middle of nowhere. Often people will choose to live out here because of the virgin soil and empty space, which helps to grow more productive crops and raise more cattle. I've noticed that these people tend to be more wealthy, as you can see that this particular farmer, Lassina Kone, was able to buy a satellite dish and a color TV! Lassina was very friendly and curious about America, and offered me some yams to take with me.

 This is one of my favorite pictures of one of my best friends, Adama. He is both very curious and very informed about the world, and loves when I get National Geographic magazines sent from home. Here, he is using an inflatable globe that I brought to explain to some people how it is the earth that moves, and not the sun. This is actually a pretty controversial subject in my village, but luckily Adama and his new globe should help settle the debate.

My Research...

I got a grant from the West African Research Association to look at malnutrition in Mali, and how it is correlated with other factors like cotton production and environmental degradation. I used a map that I published before on this blog to apply for the grant, and I think it explains a lot of the context of my research. I will be doing work in three different villages, and for each village I will be doing household surveys, as well as forest and land cover surveys.  The idea is to see if healthy forests and certain livelihood strategies (like growing cotton) have any significant relationship with patterns of malnutrition. Here are some picture from the research.

Collecting a list of all of the household heads' names from the village secretary, Lassina (black hat). I randomly picked from the list to determine which households to interview. Also pictured are my good friend Oumar, who worked with my during my Peace Corps service, as well as my host father Amadou, in the blue shirt. Lassina and Oumar were both incredibly helpful when I showed up and explained what I had to do.

Here are some shots of me conducting interviews. They were taken by Oumar, who really enjoys using the camera!

 This is me measuring Mid-Upper Arm Circumference, a good indicator of a child's overall health. For all the children in each study household, I have to measure their arms. Often they are terrified, having never seen a white person before. To make the experience less traumatic I give them candy.  Also, notice in this picture, someone in the background wearing a shirt that says something in English. There is a 0% chance she knows what that shirt says.

Finally, I give the kids candy. Actually, these are those vitamin-fortified candies you can get in America. If the kids have really skinny arms, I give their moms a couple, and tell her to feed the child one a day.

So that's my research so far! I can't wait to start the forest surveys.

Sunday, August 10, 2014

Ebola Prevalence

I am leaving in three weeks to do my research for my Masters thesis in Mali. I can't wait. However, something that has been on my mind lately is the ebola outbreak in nearby Guinea, Sierra Leone and Liberia. The media sure is talking a lot about it, and my family and friends are quite worried about this epidemic in West Africa.

But how much of an epidemic is it really? Well, there are about 21.6 million people in Guinea, Sierra Leone and Liberia, the three countries most affected by the outbreak. In those countries, there have been 959 Ebola deaths as of August 8th, according to the Word Health Organization. That means that 0.0044% of the population died from Ebola since the start of the outbreak, in March 2014.

To compare that to US statistic, we had 32,482 fatal car crashes in 2011. Given our population of 316 million, in a five month period, the average American had a 0.0048% chance of dying in a car accident in 2011.

That means an American was just about as likely to die in a car crash in a 5 month period as a West African from Guinea, Sierra Leone or Liberia was to die from Ebola in the 5 months since the outbreak began.

Ebola is dangerous, and something the world should deal with quickly and decisively. But it is not rampant, and I am not worried about it affecting me during my fieldwork in Mali.

Monday, July 21, 2014

The Gaza strip is smaller than the micronation of Andorra, and it's almost impossible to get out.  Israel controls every exit but one, and they are heavily blockaded, creating a humanitarian crisis.  A good map of Gaza's possible border crossings was made by The Palestinian Academic Society for the Study of International Affairs (PASSIA) and is here. The one border crossing into Egypt, the Rafah crossing, recently opened up to admit only injured Palestinians into Egypt. This border crossing and the nearby illicit tunnels may be what is driving much of the conflict. Israel wants to shut down the border crossing and nearby tunnels to cut off Hamas's access to supplies and therefore their ability to govern, whereas Hamas maybe be hoping to leverage the conflict to pressure Egypt to make the border crossing more open.

Being able to travel freely is one of the greatest privileges afforded to citizens of the first world, clearly illustrated in this map. This privilege is completely absent for the citizens of Gaza, trapped in a war in and a scant 139 square miles. To illustrate this, I took screenshots from MapFrappe of the area of Gaza overlaying different familiar areas of the world. Imagine being stuck somewhere a third the size of New York City - with rockets falling!

Sunday, July 13, 2014

Philippines Language Maps



I love language maps, and looking at how languages interact with space.  Much of the history of human movement through space can be re-constructed using linguistics and language maps.  For example, linguistic data shows that the Malagasy of Madagascar have their origins in Borneo, Indonesia. Or that people as far flung as the Irish, Persians, Spanish, Armenians, Germans and Punjabis all speak related languages, and they all have cultural and ancestral roots in one group of people that once lived somewhere near the Black Sea.

However, even with a good understanding of where a language or group of languages are, they can be very difficult to depict on a map. Languages frequently overlap, or exist as linguistic continua that cannot be categorized into distinct languages. Language cartographers will often follow political boundaries, usually incorrectly.  This map, for example, makes it appear that the use of English abruptly ends and Spanish begins at the US-Mexican border.  It also looks like you are as likely to find French speakers in the far north of Quebec as you are in Montreal.

Another major issue with language maps is that they usually rely on perceptual data, but not on real observations of languages "in the wild".  We all know that there is a boundary between Southern American English and Northern American English, but where exactly would you put the line? At the mason-dixon line? Well, no one really has a southern accent in DC or Baltimore, so what about somewhere across Virginia? It's tricky.  The best solution, which linguistic geographers have been doing for years, involves large-scale surveys, asking people what they "would" say, recording where exactly they are, and then aggregating this data.  This has led to some cool maps, like these ones, but is incredibly time- and labor-intensive.

I believe that recently a new solution has emerged to these problems in mapping languages and dialects.  In the past few years, geotagged social media have become widely available, offering massive and readily available data sets for mapping everything from linguistic trends to sports fan domains to preferences for church vs beer.  Maps made from such large, geotagged, linguistic corpora show real occurrences of linguistic phenomena, rather than just perceptual linguistic boundaries.  Additionally, because such data is available in point form, it makes it much easier to display overlapping languages and linguistic continua.  So, I decided to take a crack a this, and mapped the languages of the Philippines using tweet data:



Collecting Tweets

To collect the tweets, I used the R package twitteR, a wrapper for the twitter API.  I divided the Philippines into 1036 evenly-spaced points, and searched for all tweets within a 10 mile radius of the point, covering the whole area of the Philippines. I ran this every night for 5 days until I had one million tweets, of which about 25% were georeferenced.


Collecting Corpora

In order to identify the language of the tweets, I needed corpora.  A linguistic corpus (singular of corpora) is a large body of text in a given language.  This large body of text is used to generate data, usually statistical signatures, that can be used to determine if a given sample text (like a tweet) has the same features as the corpora.  So, in order to tell if a tweet is in, say, Hiligaynon, you need a lot of samples of Hiligaynon.  To build these corpora, I got samples of literary and religious texts in Tagalog, Bikol, Ilokano, Hiligaynon, Pangasinense, Kapampangan, Cebuano and Waray from SEAlang. However, people speak quite differently in a religious or literary setting than they do while tweeting informally, so to get more "modern" samples of each language, I also built corpora by web scraping from Wikipedia using Python's BeautifulSoup package.  I made a script that collects the body text from random Wikipedia articles (using the "Random Article" link), and I also made a script that starts at the page for the Philppines and collects text from every page that it links to, and then every page that those ones link to, etc.  I used both scripts until I generated a corpora of 300,000 words for every one of the aforementioned languages except Hiligaynon, as well as for Chavacano de Zamboanga, which SEAlang did not have a corpus for. Hiligaynon has a beta-wiki with a couple hundered pages, and I used every single one of them to create a considerably smaller corpus.


Identifying Languages

Most language identification algorithms work by taking 3 and 4 letter samples of the corpus (called 3-grams and 4-grams, or just n-grams) and determining their distribution of the frequency of their occurrences.  This is done for multiple languages and corpora.  These n-grams are also sampled from the text to be identified, and the distribution of n-grams from the sample text is compared to the distribution of n-grams in the various corpora. Whichever corpora's distribution of n-grams most closely matches that of the n-grams of the sample text is determined to be the language of the sample text.

However, this method fails miserably for Austronesian languages, and exclusive word lists must be used.  This is because Austronesian languages have relatively small phonemic inventories (Hawaiian only has eight consonants!) and almost all have a simple, Consonant-Vowel syllable structure.  Thus there is not enough variability in possible n-grams, and languages cannot be classified based on n-gram distributions.  Google translate uses the n-gram method, and when I use it for Tagalog, it frequently thinks that the text I am entering is in Indonesian or Cebuano.  A more relevant example of this is in other attempts to map work languages.  Two high-res language maps of twitter exist here and here. Outside of the Philippines, they are fantastic (just look at Europe), but I am assuming that they use the n-gram method, because they both mis-identify Tagalog (and other Filipino languages) as Indonesian:




Comparison to Other Maps

So here is my final map placed next to a map of the languages of the Philippines from Wikipedia.  The map from Wikipedia does a  good job of quickly showing where one might find minority languages, but it does a bad job of showing location precisely, specifically with regard to density.  It also does a pretty bad job of showing where languages overlap.  It is clear in the tweet map, for example, that while the minority languages are confined to certain regions, English and Tagalog are prevalent throughout the country.  This is unclear in Wikipedia's map, which makes it appear as if Tagalog is just one of many minority languages, when in fact it is far more prevalent.  The tweet map also does a much better job of displaying language density.  Ilocano, for example, is most common on the northeast coast of Luzon, is much less common in north-central Luzon (because of the many small languages spoken in the mountains) and is also uncommon on the northwest coast of Luzon (because the whole area is sparsely populated).  This is clear in the tweet map, where most of the Ilocano tweets are on the northeast coast and only scattered tweets appear on other regions, whereas the Wikipedia maps makes the Ilocano area appear uniform.  The Wikipedia map does add the diamond indicating that a language is only a plurality - but it does not indicate which other languages are present or where a given language is concentrated, like the dot map does.


English vs Tagalog

Clearly, English is widely used throughout the Philippines, as is Tagalog.  However, it appears that that Tagalog and Taglish (code-switching between English and Tagalog in one tweet) are more common in the North, the homeland of the Tagalog language.  I was told by many Filipinos that southerners, who speak mostly Cebuano, resent the fact that Tagalog became the national language, since Cebuano was spoken by more people over a wider area.  For this reason, Cebuanos use less Tagalog and more English.  This map seems to confirm that, and taking the median point for Tagalog, English and Taglish shows a trend of Tagalog being more common in the north and English being more common in the south.  Nevertheless, English was much more common than Tagalog, with 80,000 English tweets, 28,000 Tagalog tweets and 16,000 tweets with at least 3 words in Tagalog and 3 words in English.


Minority Languages 

Here is a map of the minority language tweets.  As I discuss later, there were issues with language identification for these minority tweets, so their relative totals are probably not indicative of how prevalent they actually are on twitter, much less in everyday, spoken usage.  Nevertheless, I think this map does a great job of displaying their geographic distributions.  Hiligaynon (also called Ilonggo) is accurately shown as occurring on both Panay Island and Negros in the central islands (called the Visayas), as well as in the east of the southern island of Mindanao.  Cebuano was also accurately identified across its full range, despite the fact that is has many different varieties, indicating that the corpus drew on a good variety of texts.  This was not true for Bikol, which also has many varieties.  Because the corpus from Wikipedia was based on Central Bikol, also called Naga Bikol, both of the Bikol tweets identified were found near Naga, and not in the larger city of Albay, which speaks a slightly different variety. 

Many of the minority languages were very under-represented on twitter, with only one tweet coming in from Waray, two from Bikol, and nine from Kapampangan.  In addition, no tweets were found in Pangasinse, spoken between Kapampangan and Ilocano, or Chavacano de Zamboanga, a creole based on Spanish and local languages, spoken in Mindanao.


Issues with Language Identification

One major issue with this study was that the corpora and and tweets were written in quite different registers.  Filipinos use different words when they are writing tweets than when they are writing the bible.  And even when they are using the same words, they spell those words differently.  This is especially true for minority languages, which are almost entirely spoken and rarely used in a formal literary setting.  English, on the other hand, is the main language used at school, business and in other settings that involve a lot of writing.  Thus, Filipinos are much more likely to use "proper" English than they are "proper" Tagalog when writing tweets, and they would almost never never use the "proper" spelling and vocabulary of their minority languages.  However, my corpora were based on the "proper" versions of these languages, and thus minority languages are quite underrepresented, and my study does not represent an accurate measure of, say, Waray usage versus English usage.  Only about 125,000 of tweets 250,000 geotagged tweets could be classified, and I believe that most this missing half represent tweets written in "improper" Tagalog and minority languages (as well as a couple tweets in completely different languages, like Chinese or German).

For example, to say "I have not yet been to that new SM" would be, in proper Tagalog, "Hindi pa ako nakapunta sa yung bagong SM".  However, a Filipino using social media would likely write something like "Di p aq nkpunta sa yn bago SM".  This manner of writing has been taken to an almost incomprehensible level in the cultural phenomenon of jejemon, (similar to leetspeak in English) and cannot be identified as Tagalog from a corpus in "proper" Tagalog.  Slang is also quite common in spoken Tagalog and in spoken minority languages, and is impossible to identify from a "proper" corpus.  In fact, there are whole dialects of Tagalog based on slang, such as this code language used by gays. One Bikol word I learned when I was living the the Philippines was "uragon", which means strong or manly.  However, this word was not in the Bikol corpus I generated: nowhere in the Bible, in literary texts or on the Bikol Wikipedia is it used.

Finally, the end result of the language classification required a great deal of cleaning and spot checking.  This was because many of the minority languages' corpora contained words that were also in English slang, so many English tweets were misidentified as minority language tweets.  For example "haha" is a word in Waray (or at least, is a word used in the Waray bible, Waray literary works, or the Waray Wikipedia) and "omg" is a word in Kapampangan.  Since I decided that three matching words makes a tweet fall into a given language category, the following tweets were classified as Kapampangan and Waray:

"ge tawa tayo :( haha haha haha""

I had to do a lot of spot checking and re-running the classification before I ended up with results that I was satisfied with.



I think there are a number of ways I could do this project better, if I ever wanted to use this for an academic conference or paper.  A larger twitter data set could help capture more tweets from the more obscure minority languages.  However, better language classification techniques could definitely make major improvements.  I think the best way would be to build corpora from tweets themselves. To get this, I'd have to find native speakers from each minority language, give them a data set of a couple thousand tweets, and have them identify the ones that are in the minority languages they know.  Then, these training tweets would be used to classify, say 1 million tweets.  Another possibility would be to run an "unsupervised" classification.  With this method, there are no training data set or corpora, but rather all of the tweets would be sorted into natural groups based on certain statistical features.  I have done this before with pixels in remotely sensed satellite imagery, but I am not exactly sure how to go about this for text data, or if it has ever been done before.

I would also like to use this method that I have just developed in other places outside of the Philippines.  I think Indonesia and Malaysia would be good candidates, as they are other areas with high twitter penetration, many minority languages, and Austronesian languages that cannot be classified with the n-gram method.  Another possibility would be to map the languages of tweets in various international cities with a lot of linguistic diversity, like Singapore, London, New York and Hong Kong.

Monday, June 23, 2014

Most and least studied places on earth. 

Since I do research in Mali, I've spent a lot of time on sites like Google Scholar, Scopus and Web of Science looking for articles about similar research done there. But, there aren't a lot of articles on this obscure African country. I wondered how research on Mali compares to the amount done in other African countries, or even other countries in the world. So I decided to make a map (or 3).

And I was right, Mali is one of the more under-studied places on earth, but so are most African countries.  Both per Capita and per Km, the DRC is the most understudied, which is a real shame.  DRC is home to some incredible natural wealth and biodiveristy which is definitely deserving of more scholastic attention, as well as the most deadly conflict since World War II, which needs more attention from the world in general.  I was surprised to see so many net results about China, but I guess there is a lot to study there.

These results just show the raw amount of articles returned, and don't show the subjects that research was done it, whether the research was done in books or articles, or even when the publishing was done.  All of this would be super interesting to dig into, if only google made automated queries and web-scraping more easy.

Another major caveat is that some country names will only come up in English-language queries, for example "Germany" and "Democratic Republic of the Congo" will only return English results, whereas a country name like "Mali" will return results in most languages that academic articles are written in.

For countries whose names mean more than just the country, I tried to exclude other related terms.  So, for Guinea, I had to exclude Guinea Worm, Equatorial Guinea, Guinea Pig, Guinea Fowl, etc.

I could have simply searched for every country on Google Scholar manually. But why do something repetitive and boring on a computer when that's exactly the kind of task that computers are good at doing? You just have to know how to tell a computer what exactly you want it to do. So I wrote a script in Python using the BeautifulSoup module to query Google Scholar for countries and return the number of articles about that country. But... google doesn't permit queries that aren't from browsers, or queries that are automated. So I had to disguise python as a browser, and I could only get about 40 results before Google Scholar stopped returning them. All in all, it took me a few days, whereas just doing it manually would have taken maybe an hour.  But, hey, it was fun!

Friday, June 20, 2014

Cellular Automata inspired background

Some people in geography have tried to use the mathematical concept of cellular automata (CA) to understand changes in land cover and land use. I tried to capture this idea in an Escheresque image, which I have set as a background for the blog.  It shows a gradual transition from an elementary Cellular Automata into a satellite image of agriculture and finally a city on a river.  I generated the CA using NetLogo, and found a Landsat image of Minnesota online.  It was a fun foray into photoshop.

Cellular automata are really interesting. The idea is that each cell can be black or white, and a cell's color changes depending upon the color of its neighbors.  Every single cell follows the same exact rules.  Yet, large-scale patters seem to emerge.  One version of CA, known as Conway's game of life, has incredibly complex figures emerge, as seen in this epic youtube video.

The most fascinating thing about CA is that incredibly complex patterns can emerge from very simple rules. So it's only natural that geographers trying understand the complex patterns of land use change would turn to the relatively simple rules of CA.

In fact, the specific CA that I chose for my background, known as Rule 110 in elementary cellular automata, is possibly the simplest system in the world that is Turing-Complete.  This means that this particular CA can run computations. In fact, given sufficient cells, it could run Windows XP!.  With such astounding properties, CA-like rules could explain the emergences of things like landscapes and cities, or even reality.

Monday, January 20, 2014

Oh yea, and here's one from a grant I'm working on. I want to look at why Yanfolila has lower rates of anemia. Could it be because they don't grow cotton, and (seem to) have healthier, thicker forests? Give me some money, and I'll go find out! I served in Peace Corps in the extremely anemic region of Kolondieba.