Using Pandas Read_csv With Missing Data
Solution 1:
Try this:
import pandas as pd
import numpy as np
import io
datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X| | | 72.3")
names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float]
dform = {name: dtypes[ind] for ind, name in enumerate(names)}
colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}
df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names
Edit: To converter dtypes post imports.
df["number"] = df["data"].astype('int')
df["data"] = df["data"].astype('float')
Your data has mixed of blanks as str and numbers.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id 2 non-null object
flag 2 non-null object
number 2 non-null object
data 2 non-null object
data2 2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes
If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.
Solution 2:
So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to
na_count_old = na_count
print(col_res)
for ind, row in enumerate(col_res):
k = kh_get_str(na_hashset, row.strip().encode())
if k != na_hashset.n_buckets:
col_res[ind] = np.nan
na_count += 1
else:
col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)
if na_count_old==na_count:
# float -> int conversions can fail the above
# even with no nans
col_res_orig = col_res
col_res = col_res.astype(col_dtype)
if (col_res != col_res_orig).any():
raise ValueError("cannot safely convert passed user dtype of "
"{col_dtype} for {col_res} dtyped data in "
"column {column}".format(col_dtype=col_dtype,
col_res=col_res_orig.dtype.name,
column=i))
which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).
While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.
Post a Comment for "Using Pandas Read_csv With Missing Data"