DSCN3651

MIT Course Majors Male Female Ratio, Fall 2011 (scraping with google docs)

Reading Time: 3 minutes
as seen at stata; click to see larger pic

when i saw this in stata a few days ago i legitimately wondered for a few seconds which way this was color-coded until I saw a note (cropped out in this pic), which made me smile. At MIT and hopefully in the general world the imperfect gender ratios are really not so noticeable unless you’re looking for it. MIT is wonderful (I say this now because I have no finals this term, all project classes that I loved even if I didn’t do as well as I’d’ve liked)

Anyway, there were no citations! so I decided to go off and investigate on my own and treat it as a data visualization exercise (hi, upcoming IAP / startlabs / c2c / metrify).

EDIT: Derp! I data-failed. You can see civil/environmental engineering (course 1 — see http://web.mit.edu/catalog/subjects.html if you’re confused by the numbers) is definitely off. : / messed up copying and pasting somewhere. Will redo this for realz one day (they have data for multiple years, I want to clean it up automatically).

sorted by %
sorted by course number

~~

http://www.quora.com/MIT-1/What-is-the-gender-ratio-for-each-of-the-majors-at-MIT
data source
Here’s the data I used (I should just email them for the csv’s, but I thought this would be a simple data scraping exercise):
http://web.mit.edu/registrar/stats/gender/index.html
http://web.mit.edu/registrar/stats/yrpts/index.html
What tripped me up is that the women vs. everyone listing of majors is ever so slightly off, and I have to decide how many distinctions I want to make. Do I really care about 2 vs 2A? not really. If I care about 6-1 vs 6-2 vs 6-3, do I just drop 6-7?

assorted google spreadsheets of data from those sources
cleaned data for charts
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdHFNZU5mbEdYU2VIOE1BUTRLZHJJVkE

raw data sorted for female / undergards
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFg0Vk14RW9JQjJtSFFsMXJBZ2hnSEE

raw scraped html table data
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFYzNUdEN2NKM2lGbHV6OVN4YkJNTmc

where i tried to (see script editor) automagically clean up the data and failed:
https://docs.google.com/spreadsheet/ccc?key=0Atr0HZeoMbmgdFRkSXp5SW9teEJiT0t5U2FJeTBUVnc&hl=en_US#gid=0

the
google apps code 
I used for that last gdoc



Google Docs Scripting Tutorials
An excellent step-by-step free tutorial on ImportXML (similar to importHTML that I used), how I heard of xpath (which I didn’t end up need to use due to importHTML magic): http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
other useful tutorials I referred to
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/
http://blog.fastfedora.com/2011/08/how-to-use-open-apis.html *shudders* speaking of government apis, I really should be less terrible and fix my github codeforamerica repos which I left half-finished. AUGH. I FAIL. This is why I haven’t applied to any jobs yet. Upcoming: post on why you shouldn’t hire me (with the caveat that I feel okay listing these because these are past mistakes that I don’t intend to make again).
http://blog.ouseful.info/2010/02/15/creating-a-winter-olympics-2010-medal-map-in-google-spreadsheets/
eeee shiny http://pipes.yahoo.com/pipes/
I also attempted to use google fusion tables but did not have a good experience.
http://code.google.com/googleapps/appsscript/articles/writing_spreadsheet_data.html
http://code.google.com/googleapps/appsscript/service_spreadsheet.html
To investigate: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning

social medialulz

facebooktwittergoogle_plusredditpinterestmail

Leave a Reply

Your email address will not be published. Required fields are marked *