How to load a csv file?#

This notebook shows how to load a csv file as a pandas.DataFrame.

Data is stored in figshare (check out How to download data from figshare? for more details on how to download it).

Loading data#

[1]:
import polpo.preprocessing.pd as ppd
from polpo.preprocessing.load import FigsharePregnancyDataLoader
[2]:
loader = FigsharePregnancyDataLoader(
    data_dir="~/.herbrain/data/pregnancy",
    remote_path="28Baby_Hormones.csv",
    use_cache=True,
)

data = (loader + ppd.CsvReader())()

data
[2]:
sessionID estro prog lh gestWeek stage EndoStatus trimester
0 ses-01 NaN NaN NaN -3.0 pre pilot1 pre
1 ses-02 3.42 0.840 NaN -0.5 pre pilot2 pre
2 ses-03 386.00 NaN NaN 1.0 pre IVF pre
3 ses-04 1238.00 NaN NaN 1.5 pre IVF pre
4 ses-05 1350.00 2.940 NaN 2.0 pre IVF first
5 ses-06 241.00 8.760 NaN 3.0 preg Pregnant first
6 ses-07 NaN NaN NaN 9.0 preg Pregnant first
7 ses-08 NaN NaN NaN 12.0 preg Pregnant first
8 ses-09 NaN NaN NaN 14.0 preg Pregnant second
9 ses-10 4700.00 53.900 1.45 15.0 preg Pregnant second
10 ses-11 4100.00 56.800 0.87 17.0 preg Pregnant second
11 ses-12 6190.00 70.600 0.93 19.0 preg Pregnant second
12 ses-13 9640.00 54.700 0.62 22.0 preg Pregnant second
13 ses-14 8800.00 64.100 0.73 24.0 preg Pregnant second
14 ses-15 8970.00 61.400 0.73 27.0 preg Pregnant third
15 ses-16 10200.00 74.200 0.69 29.0 preg Pregnant third
16 ses-17 9920.00 83.000 0.77 31.0 preg Pregnant third
17 ses-18 9860.00 95.300 0.83 33.0 preg Pregnant third
18 ses-19 12400.00 103.000 0.59 36.0 preg Pregnant third
19 ses-20 9.18 0.120 0.96 43.0 post Postpartum post
20 ses-21 20.70 0.043 4.01 46.0 post Postpartum post
21 ses-22 17.50 0.068 7.58 49.0 post Postpartum post
22 ses-23 11.50 0.042 4.67 51.0 post Postpartum post
23 ses-24 NaN NaN NaN 68.0 post Postpartum post
24 ses-25 NaN NaN NaN 93.0 post Postpartum post
25 ses-26 NaN NaN NaN 162.0 post Postpartum post
26 ses-27 NaN NaN NaN 162.0 post Postpartum post

Manipulate data#

NB: most operations are done in place. Exploit Df.Copy if want to avoid it or inplace parameter if it exists.

[3]:
cleaning_pipe = (
    ppd.DfCopy()
    + ppd.UpdateColumnValues("sessionID", lambda entry: int(entry.split("-")[1]))
    + ppd.IndexSetter("sessionID", drop=True, inplace=True)
    + ppd.Drop(27, inplace=True)
    + ppd.Dropna(inplace=True)
)

cleaning_pipe(data)
[3]:
estro prog lh gestWeek stage EndoStatus trimester
sessionID
10 4700.00 53.900 1.45 15.0 preg Pregnant second
11 4100.00 56.800 0.87 17.0 preg Pregnant second
12 6190.00 70.600 0.93 19.0 preg Pregnant second
13 9640.00 54.700 0.62 22.0 preg Pregnant second
14 8800.00 64.100 0.73 24.0 preg Pregnant second
15 8970.00 61.400 0.73 27.0 preg Pregnant third
16 10200.00 74.200 0.69 29.0 preg Pregnant third
17 9920.00 83.000 0.77 31.0 preg Pregnant third
18 9860.00 95.300 0.83 33.0 preg Pregnant third
19 12400.00 103.000 0.59 36.0 preg Pregnant third
20 9.18 0.120 0.96 43.0 post Postpartum post
21 20.70 0.043 4.01 46.0 post Postpartum post
22 17.50 0.068 7.58 49.0 post Postpartum post
23 11.50 0.042 4.67 51.0 post Postpartum post