How to load a csv file?#
This notebook shows how to load a csv file as a pandas.DataFrame.
Data is stored in figshare (check out How to download data from figshare? for more details on how to download it).
Loading data#
[1]:
import polpo.preprocessing.pd as ppd
from polpo.preprocessing.load import FigsharePregnancyDataLoader
[2]:
loader = FigsharePregnancyDataLoader(
data_dir="~/.herbrain/data/pregnancy",
remote_path="28Baby_Hormones.csv",
use_cache=True,
)
data = (loader + ppd.CsvReader())()
data
[2]:
sessionID | estro | prog | lh | gestWeek | stage | EndoStatus | trimester | |
---|---|---|---|---|---|---|---|---|
0 | ses-01 | NaN | NaN | NaN | -3.0 | pre | pilot1 | pre |
1 | ses-02 | 3.42 | 0.840 | NaN | -0.5 | pre | pilot2 | pre |
2 | ses-03 | 386.00 | NaN | NaN | 1.0 | pre | IVF | pre |
3 | ses-04 | 1238.00 | NaN | NaN | 1.5 | pre | IVF | pre |
4 | ses-05 | 1350.00 | 2.940 | NaN | 2.0 | pre | IVF | first |
5 | ses-06 | 241.00 | 8.760 | NaN | 3.0 | preg | Pregnant | first |
6 | ses-07 | NaN | NaN | NaN | 9.0 | preg | Pregnant | first |
7 | ses-08 | NaN | NaN | NaN | 12.0 | preg | Pregnant | first |
8 | ses-09 | NaN | NaN | NaN | 14.0 | preg | Pregnant | second |
9 | ses-10 | 4700.00 | 53.900 | 1.45 | 15.0 | preg | Pregnant | second |
10 | ses-11 | 4100.00 | 56.800 | 0.87 | 17.0 | preg | Pregnant | second |
11 | ses-12 | 6190.00 | 70.600 | 0.93 | 19.0 | preg | Pregnant | second |
12 | ses-13 | 9640.00 | 54.700 | 0.62 | 22.0 | preg | Pregnant | second |
13 | ses-14 | 8800.00 | 64.100 | 0.73 | 24.0 | preg | Pregnant | second |
14 | ses-15 | 8970.00 | 61.400 | 0.73 | 27.0 | preg | Pregnant | third |
15 | ses-16 | 10200.00 | 74.200 | 0.69 | 29.0 | preg | Pregnant | third |
16 | ses-17 | 9920.00 | 83.000 | 0.77 | 31.0 | preg | Pregnant | third |
17 | ses-18 | 9860.00 | 95.300 | 0.83 | 33.0 | preg | Pregnant | third |
18 | ses-19 | 12400.00 | 103.000 | 0.59 | 36.0 | preg | Pregnant | third |
19 | ses-20 | 9.18 | 0.120 | 0.96 | 43.0 | post | Postpartum | post |
20 | ses-21 | 20.70 | 0.043 | 4.01 | 46.0 | post | Postpartum | post |
21 | ses-22 | 17.50 | 0.068 | 7.58 | 49.0 | post | Postpartum | post |
22 | ses-23 | 11.50 | 0.042 | 4.67 | 51.0 | post | Postpartum | post |
23 | ses-24 | NaN | NaN | NaN | 68.0 | post | Postpartum | post |
24 | ses-25 | NaN | NaN | NaN | 93.0 | post | Postpartum | post |
25 | ses-26 | NaN | NaN | NaN | 162.0 | post | Postpartum | post |
26 | ses-27 | NaN | NaN | NaN | 162.0 | post | Postpartum | post |
Manipulate data#
NB: most operations are done in place. Exploit Df.Copy
if want to avoid it or inplace
parameter if it exists.
[3]:
cleaning_pipe = (
ppd.DfCopy()
+ ppd.UpdateColumnValues("sessionID", lambda entry: int(entry.split("-")[1]))
+ ppd.IndexSetter("sessionID", drop=True, inplace=True)
+ ppd.Drop(27, inplace=True)
+ ppd.Dropna(inplace=True)
)
cleaning_pipe(data)
[3]:
estro | prog | lh | gestWeek | stage | EndoStatus | trimester | |
---|---|---|---|---|---|---|---|
sessionID | |||||||
10 | 4700.00 | 53.900 | 1.45 | 15.0 | preg | Pregnant | second |
11 | 4100.00 | 56.800 | 0.87 | 17.0 | preg | Pregnant | second |
12 | 6190.00 | 70.600 | 0.93 | 19.0 | preg | Pregnant | second |
13 | 9640.00 | 54.700 | 0.62 | 22.0 | preg | Pregnant | second |
14 | 8800.00 | 64.100 | 0.73 | 24.0 | preg | Pregnant | second |
15 | 8970.00 | 61.400 | 0.73 | 27.0 | preg | Pregnant | third |
16 | 10200.00 | 74.200 | 0.69 | 29.0 | preg | Pregnant | third |
17 | 9920.00 | 83.000 | 0.77 | 31.0 | preg | Pregnant | third |
18 | 9860.00 | 95.300 | 0.83 | 33.0 | preg | Pregnant | third |
19 | 12400.00 | 103.000 | 0.59 | 36.0 | preg | Pregnant | third |
20 | 9.18 | 0.120 | 0.96 | 43.0 | post | Postpartum | post |
21 | 20.70 | 0.043 | 4.01 | 46.0 | post | Postpartum | post |
22 | 17.50 | 0.068 | 7.58 | 49.0 | post | Postpartum | post |
23 | 11.50 | 0.042 | 4.67 | 51.0 | post | Postpartum | post |