How to load a csv file?#
This notebook shows how to load a csv file as a pandas.DataFrame.
Data is stored in figshare (check out How to download data from figshare? for more details on how to download it).
Loading data#
[1]:
import polpo.preprocessing.pd as ppd
from polpo.preprocessing.load.pregnancy.pilot import FigsharePregnancyDataLoader
[2]:
loader = FigsharePregnancyDataLoader(
data_dir="~/.herbrain/data/pregnancy",
remote_path="28Baby_Hormones.csv",
use_cache=True,
)
data = (loader + ppd.CsvReader())()
data
[2]:
| sessionID | estro | prog | lh | gestWeek | stage | EndoStatus | trimester | |
|---|---|---|---|---|---|---|---|---|
| 0 | ses-01 | NaN | NaN | NaN | -3.0 | pre | pilot1 | pre |
| 1 | ses-02 | 3.42 | 0.840 | NaN | -0.5 | pre | pilot2 | pre |
| 2 | ses-03 | 386.00 | NaN | NaN | 1.0 | pre | IVF | pre |
| 3 | ses-04 | 1238.00 | NaN | NaN | 1.5 | pre | IVF | pre |
| 4 | ses-05 | 1350.00 | 2.940 | NaN | 2.0 | pre | IVF | first |
| 5 | ses-06 | 241.00 | 8.760 | NaN | 3.0 | preg | Pregnant | first |
| 6 | ses-07 | NaN | NaN | NaN | 9.0 | preg | Pregnant | first |
| 7 | ses-08 | NaN | NaN | NaN | 12.0 | preg | Pregnant | first |
| 8 | ses-09 | NaN | NaN | NaN | 14.0 | preg | Pregnant | second |
| 9 | ses-10 | 4700.00 | 53.900 | 1.45 | 15.0 | preg | Pregnant | second |
| 10 | ses-11 | 4100.00 | 56.800 | 0.87 | 17.0 | preg | Pregnant | second |
| 11 | ses-12 | 6190.00 | 70.600 | 0.93 | 19.0 | preg | Pregnant | second |
| 12 | ses-13 | 9640.00 | 54.700 | 0.62 | 22.0 | preg | Pregnant | second |
| 13 | ses-14 | 8800.00 | 64.100 | 0.73 | 24.0 | preg | Pregnant | second |
| 14 | ses-15 | 8970.00 | 61.400 | 0.73 | 27.0 | preg | Pregnant | third |
| 15 | ses-16 | 10200.00 | 74.200 | 0.69 | 29.0 | preg | Pregnant | third |
| 16 | ses-17 | 9920.00 | 83.000 | 0.77 | 31.0 | preg | Pregnant | third |
| 17 | ses-18 | 9860.00 | 95.300 | 0.83 | 33.0 | preg | Pregnant | third |
| 18 | ses-19 | 12400.00 | 103.000 | 0.59 | 36.0 | preg | Pregnant | third |
| 19 | ses-20 | 9.18 | 0.120 | 0.96 | 43.0 | post | Postpartum | post |
| 20 | ses-21 | 20.70 | 0.043 | 4.01 | 46.0 | post | Postpartum | post |
| 21 | ses-22 | 17.50 | 0.068 | 7.58 | 49.0 | post | Postpartum | post |
| 22 | ses-23 | 11.50 | 0.042 | 4.67 | 51.0 | post | Postpartum | post |
| 23 | ses-24 | NaN | NaN | NaN | 68.0 | post | Postpartum | post |
| 24 | ses-25 | NaN | NaN | NaN | 93.0 | post | Postpartum | post |
| 25 | ses-26 | NaN | NaN | NaN | 162.0 | post | Postpartum | post |
| 26 | ses-27 | NaN | NaN | NaN | 162.0 | post | Postpartum | post |
Manipulate data#
NB: most operations are done in place. Exploit Df.Copy if want to avoid it or inplace parameter if it exists.
[3]:
cleaning_pipe = (
ppd.DfCopy()
+ ppd.UpdateColumnValues("sessionID", lambda entry: int(entry.split("-")[1]))
+ ppd.IndexSetter("sessionID", drop=True, inplace=True)
+ ppd.Drop(27, inplace=True)
+ ppd.Dropna(inplace=True)
)
cleaning_pipe(data)
[3]:
| estro | prog | lh | gestWeek | stage | EndoStatus | trimester | |
|---|---|---|---|---|---|---|---|
| sessionID | |||||||
| 10 | 4700.00 | 53.900 | 1.45 | 15.0 | preg | Pregnant | second |
| 11 | 4100.00 | 56.800 | 0.87 | 17.0 | preg | Pregnant | second |
| 12 | 6190.00 | 70.600 | 0.93 | 19.0 | preg | Pregnant | second |
| 13 | 9640.00 | 54.700 | 0.62 | 22.0 | preg | Pregnant | second |
| 14 | 8800.00 | 64.100 | 0.73 | 24.0 | preg | Pregnant | second |
| 15 | 8970.00 | 61.400 | 0.73 | 27.0 | preg | Pregnant | third |
| 16 | 10200.00 | 74.200 | 0.69 | 29.0 | preg | Pregnant | third |
| 17 | 9920.00 | 83.000 | 0.77 | 31.0 | preg | Pregnant | third |
| 18 | 9860.00 | 95.300 | 0.83 | 33.0 | preg | Pregnant | third |
| 19 | 12400.00 | 103.000 | 0.59 | 36.0 | preg | Pregnant | third |
| 20 | 9.18 | 0.120 | 0.96 | 43.0 | post | Postpartum | post |
| 21 | 20.70 | 0.043 | 4.01 | 46.0 | post | Postpartum | post |
| 22 | 17.50 | 0.068 | 7.58 | 49.0 | post | Postpartum | post |
| 23 | 11.50 | 0.042 | 4.67 | 51.0 | post | Postpartum | post |