[Python]타이타닉 생존율 시각화하기

파이썬/파이썬

[Python]타이타닉 생존율 시각화하기

taehee94 2022. 3. 8. 14:59

Kaggle에서 타이타닉 관련 정보를 다운받을수있으니 참고해주세요.

https://www.kaggle.com/c/titanic/overview

Titanic - Machine Learning from Disaster | Kaggle

www.kaggle.com

타이타닉의 생존율을 예측하기위해선, 그때당시의 정보가 필요합니다.

pandas와 matplotlib가 import안된다면, pip를통해 먼저 다운로드해주세요.

pip install pandas
pip install matplotlib

관련정보를 불러와서, 저장해줍니다.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

titanic_df = pd.read_csv(r'./titanic/train.csv')
test_df = pd.read_csv(r'./titanic/test.csv')
gender_df = pd.read_csv(r'./titanic/gender_submission.csv')

필요한 정보만 필터링해서 보여줍니다.

#train.csv 내용 확인 겸 필요한 정보만 필터링
df = pd.DataFrame.from_dict(titanic_df)

df[['Age','Pclass','Sex','SibSp','Survived']].head(5)

	Age	Pclass	Sex	SibSp	survived
0	22.0	3	male	1	0
1	38.0	1	female	1	1
2	26.0	3	female	0	1
3	35.0	1	female	1	1
4	35.0	3	male	0	0

타이타닉정보의 칼럼 정보를 출력해낼 수 있습니다.

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

탑승인원에따라 Cabin과 Age는 확인되지않은 정보들이 있는것으로 확인이됩니다.

생존율을 여러가지 기준으로 나눠서 시각화 해보겠습니다.

1.전체 생존율

#전체 생존율

# 파이그래프
plt.figure(figsize=(5,5))
titanic_df.Survived.value_counts().plot(kind = 'pie')
plt.show()

생존자가 1, 사망자가 0으로 그려졌습니다.

2.Pclass(좌석등급)에 따른 생존율 그래프

#Pclass(좌석등급)에 따른 생존율 그룹으로 묶어서-> 생존율평균
titanic_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

fig, ax = plt.subplots(figsize=(6,6))
 
sns.countplot(data=titanic_df, x='Pclass', hue='Survived', ax=ax)

labels = ['Dead', 'Servived']
ax.legend(labels=labels)

plt.show()

1등급석에는 생존자가 더 많고, 2등급 석에는 비슷하고, 3등급석에는 사망자가 더 많은 걸 알 수 있습니다.

3.성별에 따른 생존율

#성별에 따른생존율

titanic_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

fig, ax = plt.subplots(figsize=(6,6))
sns.countplot(data=titanic_df, x='Sex', hue='Survived', ax=ax)
 
labels = ['Dead', 'Servived']
ax.legend(labels=labels)

plt.show()

남성이 여성보다 더 많이 사망한 것을 알 수 있습니다.

4.동승객 수에 따른 생존율

#동승객 수에 따른 생존율

def bar_chart(feature):
    survived = titanic_df[titanic_df['Survived']==1][feature].value_counts()
    dead = titanic_df[titanic_df['Survived']==0][feature].value_counts()
    
    df = pd.DataFrame([survived,dead])
    df.index = ['Survived','Dead']
    df.plot(kind='bar',stacked=True, figsize=(10,5))
    
bar_chart('SibSp')

5.나이에 따른 생존율

#타이타닉 탑승 나이대
fig, ax = plt.subplots(figsize=(10,6))
 
sns.histplot(titanic_df['Age'], bins=25, ax=ax)
 
plt.show()

#타이타닉 나이별 생존율
age_survival= []
 
#for문 돌리기... 1살 생존인원수 / 1살 전체인원수..2살/2살인원수...
for i in range(1,80):
    age_survival.append(titanic_df[titanic_df['Age'] < i]['Survived'].sum() / len(titanic_df[titanic_df['Age'] < i]['Survived']))
 
plt.figure(figsize=(7,7))
plt.plot(age_survival)
plt.title('Survival')
plt.ylabel('Survive')
plt.xlabel('Age')
 
plt.show()