almost 3 years ago

美國國家社會安全局有1880~2010年的人口出生名字的開放資料。檔名 yob1880.txt ~ yob2010.txt
此篇試著利用pandas的dataframe格式來作繪圖與分析

先講結論:

  1. 美國出生人口逐年增加,近幾年有減緩趨勢
  2. 美國菜市場名在60~80年代趨於高峰
  3. 近十年來名字的多樣性增加

美國地區出生人口數~

# coding: utf-8

# # US Baby Names 1882-2010


# In[2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
get_ipython().magic(u'matplotlib inline')

# In[3]:

names1880 = pd.read_csv('names/yob1880.txt',names=['name','sex','births'])

# In[4]:

names1880[:5]

# Add all files into a single dataframe

# In[5]:


years = range(1880,2011)
pieces = []
columns = ['name','sex','births']

for year in years:
    path = 'names/yob{}.txt'.format(year)
    frame = pd.read_csv(path,names=columns)
    
    frame['year'] = year
    pieces.append(frame)
    
## Concatenate everything into a single Dataframe

names = pd.concat(pieces,ignore_index=True)

# In[6]:


names[:5]

# We can aggregate the data at the year and sex level using `GROUPBY` or `PIVOT_TABLE`


# In[7]:


total_births = names.pivot_table('births','year','sex',aggfunc=sum)
total_births.head(5)

# In[8]:

total_births.plot(title='Total births by sex and year')
plt.show()

出生人名的名字數(John,Harry,Mary,Marilyn)隨時間變化情形

# ## Insert a column in DataFrame


# In[9]:


def add_prop(group):
    births = group.births.astype(float)
    group['prop'] = births/births.sum()
    return group
names = names.groupby(['year','sex']).apply(add_prop)
names[:5]


# check prop is close to one


# In[10]:


np.allclose(names.groupby(['year','sex']).prop.sum(),1)


# ## Extract top 1000 names for each year


# In[11]:


def get_top1000(group):
    return group.sort_index(by='births',ascending=False)[:1000] # sort_index為舊版用法,新版用sort_value

grouped = names.groupby(['year','sex'])
top1000 = grouped.apply(get_top1000)
top1000.index = np.arange(len(top1000))


# In[12]:


top1000.tail()


# ## Analyzing Naming Trend


# In[13]:


boys = top1000[top1000.sex =='M']
girls = top1000[top1000.sex=='F']


# In[14]:


total_births = top1000.pivot_table('births',columns='name',index='year',aggfunc=sum)
total_births[['John','Mary']][0:5]


# In[15]:


subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False,
            title="Number of births per year")

前一千強(top1000)命名中,佔有的百分比

# ## Measuring the increase in naming diversity


# In[19]:


table = top1000.pivot_table('prop','year','sex',aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',xticks=range(1880,2020,10))

在前百分之五十,取名的名字數量

# In[20]:


df = boys[boys.year==2010]
prop_cumsum=df.sort_values(by='prop',ascending=False).prop.cumsum()


# In[21]:


prop_cumsum[:10]


# In[22]:


prop_cumsum.values.searchsorted(0.5)


# In[23]:


df =boys[boys.year==1900]


# In[24]:


in1900 = df.sort_values(by='prop',ascending=False).prop.cumsum()
in1900.values.searchsorted(0.5)+1


# In[25]:


def get_quantile_count(group,q=0.5):
    group=group.sort_values(by='prop',ascending=False)
    return group.prop.cumsum().values.searchsorted(q)+1

diversity = top1000.groupby(['year','sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')


# In[27]:


diversity.head()


# In[28]:


diversity.plot(title='Number of popular names in top 50%')

← 字數計算(words counts) 推薦系統簡介(協同過濾演算法) →
 
comments powered by Disqus