Using Pandas Read_csv With Missing Data

August 30, 2022 Post a Comment

I am attempting to read a csv file where some rows may be missing chunks of data. This seems to be causing a problem with the pandas read_csv function when you specify the dtype.

Solution 1:

Try this:

import pandas as pd
import numpy as np
import io

datfile = io.StringIO(u"12 23 43| | 37| 12.23| 71.3\n12 23 55|X|   |      | 72.3")

names  = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.str, np.float, np.float] 
dform  = {name: dtypes[ind] for ind, name in enumerate(names)}

colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}

df     = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None, na_values=' ')
df.columns = names

Edit: To converter dtypes post imports.

df["number"] = df["data"].astype('int')
df["data"]   = df["data"].astype('float')

Your data has mixed of blanks as str and numbers.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
id        2 non-null object
flag      2 non-null object
number    2 non-null object
data      2 non-null object
data2     2 non-null float64
dtypes: float64(1), object(4)
memory usage: 152.0+ bytes

If you look at data it is np.float but converted to object and data2 is np.float until a blank then it will turn into object also.

Solution 2:

So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to

        na_count_old = na_count
        print(col_res)
        for ind, row in enumerate(col_res):
            k = kh_get_str(na_hashset, row.strip().encode())
            if k != na_hashset.n_buckets:

                col_res[ind] = np.nan

                na_count += 1

            else:

                col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)

        if na_count_old==na_count:

            # float -> int conversions can fail the above
            # even with no nans
            col_res_orig = col_res
            col_res = col_res.astype(col_dtype)
            if (col_res != col_res_orig).any():
                raise ValueError("cannot safely convert passed user dtype of "
                                 "{col_dtype} for {col_res} dtyped data in "
                                 "column {column}".format(col_dtype=col_dtype,
                                                          col_res=col_res_orig.dtype.name,
                                                          column=i))

which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).

While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

Python Development

Using Pandas Read_csv With Missing Data

Solution 1:

Solution 2:

Post a Comment for "Using Pandas Read_csv With Missing Data"