출처 : 금융 데이터 분석을 위한 파이썬 판다스
금융 데이터 분석을 위한 파이썬 판다스
최근 인공지능 AI(Artificial Intelligence)이 보급화되면서 방대한 양의 데이터를 처리하는 방식이 중요해지기 시작했습니다. 판다스(Pandas)는 오픈 소 ...
wikidocs.net
문자열 --> TimeStamp
pd.to_datetime(datetime문자열)
import pandas as pd
ts = pd.to_datetime("2021-01-02")
print(type(ts)) # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(ts) # 2021-01-02 00:00:00
# 문자열 time -> timestamp
ts = pd.to_datetime("2021-01-02 09:00:00")
ts1 = pd.to_datetime("20210102 090000")
ts2 = pd.to_datetime("01/02/21") # 미국식 : format = '%m/%d/%y'
ts3 = pd.to_datetime("02/01/21", format="%d/%m/%y") # 4자리 연 = %Y / %m - 2자리 월 / %d - 2자리 일
print(ts) # 2021-01-02 09:00:00
print(ts1) # 2021-01-02 09:00:00
print(ts2) # 2021-01-02 00:00:00
print(ts3) # 2021-01-02 00:00:00
# utc -> timestamp
ts4 = pd.to_datetime(1628899200, unit='s') # utc
print(ts4) # 2021-08-14 00:00:00
ts = pd.to_datetime("2021-08-14 091130")
print(ts.year) # 2021
print(ts.month) # 8
print(ts.day) # 14
print(ts.hour) # 9
print(ts.minute) # 11
print(ts.second) # 30
# 요일값 반환 : 월요일 - 0
print(ts.weekday()) # 5
TimeStamp/ datetime -> datetime문자열
TimeStamp/ datetime개체.strftime(포맷형식)
import pandas as pd
import datetime
ts = pd.to_datetime("2021-08-14 091130")
print(ts.strftime("%Y-%m-%d")) # 2021-08-14
print(type(ts.strftime("%Y-%m-%d"))) # <class 'str'>
print(ts.strftime("%D")) # 08/14/21
dt1=datetime.datetime.today()
s=dt1.strftime("%Y-%m-%d")
print(s) # 2022-04-08
# Timedelta(days=,hours=,minutes=,seconds=)
import pandas as pd
import datetime
ts = pd.to_datetime("2022-04-09")
diff = pd.Timedelta(days=100, hours=2, minutes=30, seconds=30 ) # 몇일 후 계산 위해
print(diff)
print(ts + diff)
candidates = ["2021-01-01", "2021-01-02", "2021-01-03"]
idx = pd.to_datetime(candidates)
print(idx) # DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03'], dtype='datetime64[ns]', freq=None)
print(idx[0]) # 2021-01-01 00:00:00
print(idx[0:2]) # DatetimeIndex(['2021-01-01', '2021-01-02'], dtype='datetime64[ns]', freq=None)
print(len(idx)) # 3
print(type(idx)) # <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
print(type(idx[0])) # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(idx.year) # Int64Index([2021, 2021, 2021], dtype='int64')
print(idx.month) # Int64Index([1, 1, 1], dtype='int64')
print(idx.day) # Int64Index([1, 2, 3], dtype='int64')
data = [ {'시가': 100, '고가': 110, '저가': 90, '종가': 105}, {'시가': 100, '고가': 112, '저가': 80, '종가': 95}, {'시가': 99, '고가': 115, '저가': 70, '종가': 85}, {'시가': 70, '고가': 80, '저가': 60, '종가': 75}, ] df = pd.DataFrame(data, index=['20200615', '20200616', '20200717', '20200718']) print(df) |
시가 고가 저가 종가 20200615 100 110 90 105 20200616 100 112 80 95 20200717 99 115 70 85 20200718 70 80 60 75 |
cond = df.index.str[:6] == "202006" print(df.loc[ cond ]) |
시가 고가 저가 종가 20200615 100 110 90 105 20200616 100 112 80 95 |
df.index = pd.to_datetime(df.index) print(df) |
시가 고가 저가 종가 2020-06-15 100 110 90 105 2020-06-16 100 112 80 95 2020-07-17 99 115 70 85 2020-07-18 70 80 60 75 |
print(type(df.index)) # <class 'pandas.core.indexes.datetimes.DatetimeIndex'> print(type(df.index[0])) # <class 'pandas._libs.tslibs.timestamps.Timestamp'> # DatetimeIndex에서는 슬라이싱date 필터링 가능 print(df.loc[ "2020-06" ]) |
시가 고가 저가 종가 2020-06-15 100 110 90 105 2020-06-16 100 112 80 95 |
df['date'] = df.index print(type(df['date'])) # <class 'pandas.core.series.series'=""> print(type(df['date'].iloc[0])) # <class 'str'=""> print(df.date) print(df['date']) |
20200615 20200615 20200616 20200616 20200717 20200717 20200718 20200718 Name: date, dtype: object 20200615 20200615 20200616 20200616 20200717 20200717 20200718 20200718 Name: date, dtype: object |
print(df.index.year) # Int64Index([2020, 2020, 2020, 2020], dtype='int64') | |
print(df["date"].dt.year) # df["date"].year - 에러발생 : Series는 year속성x |
2020-06-15 2020 2020-06-16 2020 2020-07-17 2020 2020-07-18 2020 Name: date, dtype: int64 |
df = pd.read_excel("DB/ss_ex_1.xlsx" , index_col=0,engine="openpyxl")
df.head(3)
print(df)
종가 대비 등락률 ... 거래대금 시가총액 상장주식수
일자 ...
2021/08/13 74400.0 -2600.0 -3.38 ... 4.575268e+12 4.441518e+14 5.969783e+09
2021/08/12 77000.0 -1500.0 -1.91 ... 3.276635e+12 4.596733e+14 5.969783e+09
2021/08/11 78500.0 -1700.0 -2.12 ... 2.389977e+12 4.686279e+14 5.969783e+09
2021/08/10 80200.0 -1300.0 -1.60 ... 1.643108e+12 4.787766e+14 5.969783e+09
2021/08/09 81500.0 0.0 0.00 ... 1.267668e+12 4.865373e+14 5.969783e+09
... ... ... ... ... ... ... ...
2021/02/19 82600.0 500.0 0.61 ... 2.121275e+12 4.931040e+14 5.969783e+09
2021/02/18 82100.0 -1100.0 -1.32 ... 1.762034e+12 4.901191e+14 5.969783e+09
2021/02/17 83200.0 -1700.0 -2.00 ... 1.526409e+12 4.966859e+14 5.969783e+09
2021/02/16 84900.0 700.0 0.83 ... 1.740792e+12 5.068345e+14 5.969783e+09
2021/02/15 84200.0 2600.0 3.19 ... 1.978337e+12 5.026557e+14 5.969783e+09
[127 rows x 10 columns]
df.index = pd.to_datetime(df.index)
df = df.sort_index()
print(df)
종가 대비 등락률 ... 거래대금 시가총액 상장주식수
일자 ...
2021-02-15 84200.0 2600.0 3.19 ... 1.978337e+12 5.026557e+14 5.969783e+09
2021-02-16 84900.0 700.0 0.83 ... 1.740792e+12 5.068345e+14 5.969783e+09
2021-02-17 83200.0 -1700.0 -2.00 ... 1.526409e+12 4.966859e+14 5.969783e+09
2021-02-18 82100.0 -1100.0 -1.32 ... 1.762034e+12 4.901191e+14 5.969783e+09
2021-02-19 82600.0 500.0 0.61 ... 2.121275e+12 4.931040e+14 5.969783e+09
... ... ... ... ... ... ... ...
2021-08-09 81500.0 0.0 0.00 ... 1.267668e+12 4.865373e+14 5.969783e+09
2021-08-10 80200.0 -1300.0 -1.60 ... 1.643108e+12 4.787766e+14 5.969783e+09
2021-08-11 78500.0 -1700.0 -2.12 ... 2.389977e+12 4.686279e+14 5.969783e+09
2021-08-12 77000.0 -1500.0 -1.91 ... 3.276635e+12 4.596733e+14 5.969783e+09
2021-08-13 74400.0 -2600.0 -3.38 ... 4.575268e+12 4.441518e+14 5.969783e+09
[127 rows x 10 columns]
df = pd.read_excel("DB/data_5402_20220403.xlsx" , parse_dates=['일자']) # datetimeindex로 읽기
df = df.sort_values('일자')
print(df)
일자 종가 대비 ... 거래대금 시가총액 상장주식수
493 2020-04-01 45800.0 -1950.0 ... 1.282483e+12 2.734160e+14 5.969783e+09
492 2020-04-02 46800.0 1000.0 ... 9.978310e+11 2.793858e+14 5.969783e+09
491 2020-04-03 47000.0 200.0 ... 1.074180e+12 2.805798e+14 5.969783e+09
490 2020-04-06 48700.0 1700.0 ... 1.123254e+12 2.907284e+14 5.969783e+09
489 2020-04-07 49600.0 900.0 ... 1.561333e+12 2.961012e+14 5.969783e+09
.. ... ... ... ... ... ... ...
4 2022-03-28 69700.0 -100.0 ... 8.781722e+11 4.160938e+14 5.969783e+09
3 2022-03-29 70200.0 500.0 ... 9.585895e+11 4.190787e+14 5.969783e+09
2 2022-03-30 69900.0 -300.0 ... 8.875143e+11 4.172878e+14 5.969783e+09
1 2022-03-31 69600.0 -300.0 ... 8.732126e+11 4.154969e+14 5.969783e+09
0 2022-04-01 69100.0 -500.0 ... 1.100422e+12 4.125120e+14 5.969783e+09
[494 rows x 11 columns]
print(df['일자'].dtype) # datetime64[ns]
print(type(df['일자'].iloc[0])) # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(df['일자'].dt.quarter) |
493 2 492 2 491 2 490 2 489 2 .. 4 1 3 1 2 1 1 1 0 2 Name: 일자, Length: 494, dtype: int64 |
df = df[['일자', '시가', '저가', '고가', '종가']].copy() print(df) |
일자 시가 저가 고가 종가 493 2020-04-01 47450.0 45800.0 47900.0 45800.0 492 2020-04-02 46200.0 45350.0 46850.0 46800.0 491 2020-04-03 47400.0 46550.0 47600.0 47000.0 490 2020-04-06 47500.0 47250.0 48800.0 48700.0 489 2020-04-07 49650.0 49000.0 50200.0 49600.0 .. ... ... ... ... ... 4 2022-03-28 69500.0 69200.0 69900.0 69700.0 3 2022-03-29 70000.0 69800.0 70300.0 70200.0 2 2022-03-30 70300.0 69800.0 70500.0 69900.0 1 2022-03-31 69900.0 69600.0 70200.0 69600.0 0 2022-04-01 69500.0 69000.0 69500.0 69100.0 [494 rows x 5 columns] |
df['year'] = df['일자'].dt.year df['month'] = df['일자'].dt.month print(df.head()) |
일자 시가 저가 고가 종가 year month 493 2020-04-01 47450.0 45800.0 47900.0 45800.0 2020 4 492 2020-04-02 46200.0 45350.0 46850.0 46800.0 2020 4 491 2020-04-03 47400.0 46550.0 47600.0 47000.0 2020 4 490 2020-04-06 47500.0 47250.0 48800.0 48700.0 2020 4 489 2020-04-07 49650.0 49000.0 50200.0 49600.0 2020 4 |
gb = df.groupby(['year', 'month']) gb1=gb.get_group( (2021, 2) ).head( ) print(gb1) |
일자 시가 저가 고가 종가 year month 287 2021-02-01 81700.0 81000.0 83400.0 83000.0 2021 2 286 2021-02-02 84100.0 83700.0 86400.0 84400.0 2021 2 285 2021-02-03 84800.0 83400.0 85400.0 84600.0 2021 2 284 2021-02-04 83500.0 82100.0 83800.0 82500.0 2021 2 283 2021-02-05 83100.0 82500.0 84000.0 83500.0 2021 2 |
how = { "시가": 'first', "저가": min, "고가": max, "종가": 'last' } df1=gb.agg(how) print(df1) |
시가 저가 고가 종가 year month 2020 4 47450.0 45350.0 52000.0 50000.0 5 48900.0 47200.0 51200.0 50700.0 6 50800.0 49900.0 57000.0 52800.0 7 53400.0 52100.0 60400.0 57900.0 8 57800.0 54000.0 59900.0 54000.0 9 54100.0 54100.0 61300.0 58200.0 10 57500.0 56600.0 61500.0 56600.0 11 56400.0 56000.0 69500.0 66700.0 12 67100.0 67100.0 81300.0 81000.0 2021 1 81000.0 80200.0 96800.0 82000.0 2 81700.0 81000.0 86400.0 82500.0 3 85100.0 80600.0 85300.0 81400.0 4 82500.0 81500.0 86200.0 81500.0 5 81000.0 78400.0 83500.0 80500.0 6 80500.0 79600.0 83000.0 80700.0 7 80500.0 78100.0 81300.0 78500.0 8 79200.0 72500.0 83300.0 76700.0 9 76700.0 73700.0 77800.0 74100.0 10 73900.0 68300.0 74000.0 69800.0 11 70200.0 69600.0 76200.0 71300.0 12 72000.0 71600.0 80800.0 78300.0 2022 1 79400.0 71200.0 79800.0 73300.0 2 74900.0 71000.0 75800.0 72100.0 3 72300.0 68700.0 73100.0 69600.0 4 69500.0 69000.0 69500.0 69100.0 |
# groupby가 실행되는 규칙을 지정 # 3m - 3분할 월 # w - 주 # d - 일 df1 = df.groupby( pd.Grouper(key='일자', freq='m') ).agg(how) print(df1) |
시가 저가 고가 종가 일자 2020-04-30 47450.0 45350.0 52000.0 50000.0 2020-05-31 48900.0 47200.0 51200.0 50700.0 2020-06-30 50800.0 49900.0 57000.0 52800.0 2020-07-31 53400.0 52100.0 60400.0 57900.0 2020-08-31 57800.0 54000.0 59900.0 54000.0 2020-09-30 54100.0 54100.0 61300.0 58200.0 2020-10-31 57500.0 56600.0 61500.0 56600.0 2020-11-30 56400.0 56000.0 69500.0 66700.0 2020-12-31 67100.0 67100.0 81300.0 81000.0 2021-01-31 81000.0 80200.0 96800.0 82000.0 2021-02-28 81700.0 81000.0 86400.0 82500.0 2021-03-31 85100.0 80600.0 85300.0 81400.0 2021-04-30 82500.0 81500.0 86200.0 81500.0 2021-05-31 81000.0 78400.0 83500.0 80500.0 2021-06-30 80500.0 79600.0 83000.0 80700.0 2021-07-31 80500.0 78100.0 81300.0 78500.0 2021-08-31 79200.0 72500.0 83300.0 76700.0 2021-09-30 76700.0 73700.0 77800.0 74100.0 2021-10-31 73900.0 68300.0 74000.0 69800.0 2021-11-30 70200.0 69600.0 76200.0 71300.0 2021-12-31 72000.0 71600.0 80800.0 78300.0 2022-01-31 79400.0 71200.0 79800.0 73300.0 2022-02-28 74900.0 71000.0 75800.0 72100.0 2022-03-31 72300.0 68700.0 73100.0 69600.0 2022-04-30 69500.0 69000.0 69500.0 69100.0 |
'DataFrame' 카테고리의 다른 글
데이터 샘플링 (0) | 2022.04.09 |
---|---|
컬럼 shift (0) | 2022.04.09 |
판다스 데이터프레임 (데이터 읽기/ 저장하기) (0) | 2022.04.07 |
판다스 데이터프레임 (고급기능2) (0) | 2022.04.07 |
판다스 데이터프레임 (고급기능1) (0) | 2022.04.07 |
댓글