When you use Pandas to process data, it is common to read data from Excel or CSV files, and sometimes you need to export the processed data to Excel or CSV files. Today, we will learn how to read and export common Pandas files.

Loading Excel files

In Pandas, the Excel file reading method is: pd.read_excel(). The specific passable parameters are.

1
pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, **kwds)

Where.

  • io: excel file, can be a file path, file URL, file-like object, xlrd workbook
  • sheetname: return the specified sheet, parameters can be string (sheet name), integer (sheet index), list (elements are string and integer, return dictionary {‘key’:‘sheet ‘}), none (return dictionary, all sheets)
  • header: specify the table header of the data table, the parameters can be int, list of ints, that is, the number of index lines for the table header
  • names: return the column of the specified name, the parameter is array-like object.
  • index_col: set the index of the column, the parameter can be int, list of ints
  • usecol: set the need to parse the column, the default is None, on behalf of the parsing prime, if you pass an int, on behalf of the parsing to the last column, if you pass a list is returned is a limited column, such as: “A:E” or “A,C,E:F "
  • squeeze: if the parsed data contains only one column of data, then return a Series, the default return is DataFrame
  • dtype: you can specify the type of each column, example: {‘a’: np.float64, ‘b’: np.int32}
  • engine: must be set if io is not a buffer or path. Acceptable values are None or xlrd
  • converters: custom form that sets the conversion function to be used for the corresponding column.
  • true_values: set rest to True value, not commonly used
  • false_values: set which are False values, not commonly used
  • shiprows: the rows to be skipped, list-like type
  • nrows: the number of rows to be analyzed
  • na_values: a list of N/A values
  • parse_dates: passed in as a list, parses the specified class into date format
  • date_parser: specifies to convert the input string to variable time data. pandas default data reading format is ‘YYYY-MM-DD HH:MM:SS’. If the data to be read does not have the default format, you have to define it manually.
  • thousands: the parsing of thousands of compartmentalized numbers
  • comment: set the comment mark, the content inside the comment will not be parsed.
  • skipfooter: skip the last line.
  • convert_float: convert a float type with 0 decimal places to int
  • **kwds: unclear

This function returns the DataFrame or dict of DataFrame object in pandas, which can read the corresponding data by using the relevant operations of DataFrame.

1
2
3
4
5
import pandas as pd

excel_path = 'example.xlsx'
df = pd.read_excel(excel_path, sheetname=None)
print(df['sheet1'].example_column_name)

The main parameters of this function are io, sheetname, header, names, encoding. encoding is not described in the parameter description above, its main function is to specify which encoding (standard set of characters in the codecs package) to read the data.

For example, the following error is reported when reading a file.

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x84 in position 36: invalid start byte

The solution is to set encoding=“utf_8_sig” or encoding=“cp500″ or encoding=“gbk”, you need to try it by yourself.

Loading CSV files

In Pandas, the Excel file reading method is: pd.read_csv(). The specific passable parameters are.

1
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)

Parameters that differ from read_excel are.

  • filepath_or_buffer: this can accept a file name, or a URL, or an open file handle, or any other object that provides a read method.
  • sep and delimiter: these two parameters mean the same thing, and delimiter is an alias for sep; the default function of read_table is implemented if \t (tab) is specified; regular expressions are supported to match certain non-standard CSV files
  • mangle_dupe_cols: the column X that will be flushed with dummy, specified as 1, X.2, …
  • skipinitialspace: skip spaces after separators.
  • keep_default_na: whether to include the default NaN value when parsing data.
  • na_filter: detects missing value tokens (empty strings and values of na values). In data without NAs, passing the filter False can improve the performance of reading large files
  • verbose: indicates the number of NA values placed in non-numeric columns
  • skip_blank_lines: if true, skips blank lines instead of interpreting them as NaN values.
  • infer_datetime_format: if True and parse_dates are enabled, Pandas will try to infer the format of the time string for the differences in the columns and switch to a faster analysis method if it can be inferred. In some cases, this can make parsing 5-10 times faster.
  • keep_date_col: whether to keep the original column after parsing out the date sequence
  • dayfirst: date format, DD/MM which comes first
  • iterator: return the TextFileReader object for iteration or get chunk ().
  • chunksize: return the TextFileReader object for iteration. For more information about iterator and chunksize, see the IO Tools documentation.
  • compression: for real-time decompression of disk data
  • decimal: Identifies characters with decimal points
  • lineterminator: character to separate files into lines, only valid for C parsers
  • quotechar: character to indicate the beginning and end of a referenced item. Quoted items can include the separator character, which will be ignored.
  • quoting: controls the field referencing behavior
  • escapechar: the character to be skipped?
  • dialect: If supplied, this parameter will override the values of the following parameters (default or not) : separator, double quote, escapechar, skipinitialspace, quotechar and quoting. If you need to override the values, a ParserWarning will be issued. See the csv dialect documentation for details.
  • tupleize_cols: leave a list of tuples on the columns (default is to convert to multiple indexes on the columns)
  • error_bad_lines: Lines with too many fields (e.g. csv lines with too many commas) will raise an exception by default and no DataFrame will be returned. If there is an error, then these “bad lines” will be removed from the returned DataFrame.
  • warn_bad_lines: If the error error behavior is False and the warning error line is True, then a warning for each “bad line” will be output.
  • doublequote: Specifies a quotechar when the reference is not QUOTE_NONE, indicating whether two consecutive quotechar elements are to be interpreted as a single quotechar element.
  • delim_whitespace: sets whether to use whitespace for field separations
  • low_memory: internal processing of blocks of files, resulting in lower memory usage while analyzing, but possibly mixed type inference. To ensure no mixed types set False, or specify a type with a dtype parameter. Note that the entire file is read into a single DataFrame, and the data is returned as chunks using the chunksize or iterator parameters. (Only the c parser is valid)
  • memory_map: If a file program is provided for a file or buffer, maps the file object directly into memory and accesses the data directly from memory. Use this option to improve performance because there is no longer any i / o overhead.
  • float_precision: Specifies the converter that the c engine applies to floating point values.

The method of loading CSV files is similar to loading Excel. However, in terms of performance, importing csv is much faster than importing Excel, so it is recommended to use csv import. However, if there are some problems with the format of the imported csv file, the problem of wrong rows may occur.

In addition, in addition to importing CSV and Excel, Pandas also supports importing in the following ways.

  • read_sql(query, connection_object): import data from SQL table/library
  • read_json(json_string): import data from JSON format string
  • dataFrame(dict): import data from a dictionary object, Key is the column name, Value is the data
  • read_html(url): parse the URL, string or HTML file, extract the tables table.
  • read_clipboard(): get the content from your pasteboard and pass it to read_table()
  • read_table(filename): import data from a delimited text file

It is necessary to focus on pd.read_sql(query, connection_object) here, and planning will be shared later in the study.

Export to Excel file

The method of exporting to Excel is very simple.

1
2
3
4
import pandas as pd
excel_path = 'example.xlsx'
df = pd.read_excel(excel_path, sheetname=None)
df.to_excel('output.xlsx')

The specific export method also has numerous parameters.

1
DataFrame.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None)

Parameter meanings are.

  • excel_writer: the target excel file to be written, can be a file path, ExcelWriter object;
  • sheet_name: the name of the sheet being written, string type, default is ‘sheet1′;
  • na_rep: missing value representation, string type;
  • float_format: the format of the floating point number
  • columns: the columns to be written
  • header: if or not to write table header information, boolean or list of string type, default is True;
  • index: whether to write the row number, boolean type, default is True;
  • index_label: index label
  • startrow: start row, the rest will be discarded
  • startcol: start column, the rest will be discarded
  • engine: engine to write to, can be: excel.xlsx.writer, io.excel.xls.writer, io.excel.xlsm.writer
  • merge_cells: merge cell configuration
  • encoding: specify write encoding, string type.
  • inf_rep: specifies the representation of mathematical symbol infinity in Excel
  • verbose: unknown
  • freeze_panes: freeze panes

Export to CSV file

The export method is the same as the Excel type, with the following specific method parameters.

1
DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')

The specific parameter meanings are (duplicated with “Import CSV file” and “Export to Excel file” are not explained separately)

  • path_or_buf: the file name or file path to be written
  • mode: mode of writing file, default is w, change to a for append.
  • date_format: time format setting

Pandas also supports the following export methods.

  • to_sql(table_name, connection_object): export data to SQL table
  • to_json(filename): export data to text file in Json format