Python Pandas DataFrame basics

Pandas Dataframe

Pandas DataFrame

Pandas DataFrame is a 2-dimensional, titled data structure with columns of potentially different types. You can think of it like a spreadsheet or database table, or a dict of Series objects. It is  the most commonly used pandas object. Pandas DataFrame accepts many different kinds of input like Dict of one-dimensional ndarrays, lists, dicts, or Series, two-dimensional ndarrays, structured or record ndarray, a dictionary of Series, or another DataFrame.

df = pd.DataFrame(data=None, index=None, columns=None)

Here, df is the Pandas DataFrame and data can be NumPy ndarray, dict, or DataFrame. Along with the data, you can optionally pass an index (row labels) and columns (column labels) attributes as arguments. If you pass an index and/or columns, you are warranting the index and/or columns of the resulting Pandas DataFrame. Both index and columns will default to range(n) where n is the length of data, if they are not provided. When the data is a dictionary and columns are not specified, then the Pandas DataFrame column labels will be the dictionary’s keys.

Create Pandas DataFrame from Dictionary of Series/Dictionaries

You need to import pandas library. Create a dictionary whose values are a series of a one-dimensional arrays. You can create a Pandas DataFrame from a dictionary of series. Pass dict_series dictionary as argument to DataFrame() class which returns a Pandas DataFrame object. Index labels are passed as a list to index attribute. If the number of labels specified in the various series are not the same, then the resulting index will be the union of all the index labels of various series. In DataFrame df under column “one“, there is no data element associated with index label “d“, so NaN will be inserted at that position. A number of rows and columns in a Pandas DataFrame is obtained using shape attribute. Get the index labels for the Pandas DataFrame using index attribute. With columns attribute, you get all the columns of the DataFrame. Get the columns of DataFrame as a list by passing the columns attribute as an argument to the list() function. You can create a Pandas DataFrame from a dictionary without using index and columns attribute. For dict_df  DataFrame , a and b are columns and index labels are integers ranging from zero to two .

Create DataFrame from ndarrays/lists/list of dictionaries

The pandas library is built on top of NumPy. Here, dict_ndarrays  is a dictionary of ndarrays from which you can create a Pandas DataFrame. Also, nested lists can be used to create a Pandas DataFrame. If no index and columns are specified, then both index and columns will have integer labels. Keys are considered as column labels when a Pandas DataFrame is created using dictionaries. The Pandas DataFrame columns will be preserved in the same order as specified by dictionary keys. In, index labels are specified for a DataFrame created from the dictionary. The Pandas DataFrame can also be created from a list of dictionaries. Since DataFrame columns will be a union of all the keys in the list of dictionaries, elements for missing columns will be NaN.

Pandas DataFrame Column Selection, Addition and Deletion

Create Pandas DataFrame df  from la_liga dictionary. You can select a particular column in a DataFrame by specifying the column name within quotes inside a bracket of a Pandas DataFrame. You can add a new column to the DataFrame by specifying the column label within the bracket of DataFrame and assign data elements to it. Grab multiple columns from a Pandas DataFrame by passing a list of columns. You can also create a new column by making use of the data elements found in existing columns. Column “Points” is inserted to the Pandas DataFrame df after multiplying all the data elements in column “Won” by 2. You can perform basic arithmetic operations on Pandas DataFrame columns . When inserting a scalar value, it will congenitally be propagated to fill the column. Columns can be deleted or popped . By default, columns get inserted at the end. The insert function is available to insert at a dainty location in the columns . You can rename the column label using the rename() method. The columns attribute has to be passed to the rename() method and assign it with a dictionary where the old column label will be key and new column label will be a value of . All the above operations have a direct impact on the Pandas DataFrame.

Displaying Data in Pandas DataFrame

DataFrame head(n)  method returns first n rows and tail(n)  method returns last n rows. You can find unique data elements in a column by chaining unique() method with a Pandas DataFrame column using dot notation. The unique()  method returns a one-dimensional array-like object, which can be converted to a list using tolist() method. The transpose() method flips the DataFrame over its main diagonal by writing rows as columns and vice versa. The syntax for sort_values() method is,

df.sort_values(by, axis=0, ascending=True)

where the by parameter can be a string, list of strings, index label, column label, list of index labels, or list of column labels to sort by. If the value of the axis is 0 then by may contain column labels. If the value of the axis is 1, then by may contain index labels. By default, the value of the axis parameter is 0. The default value of ascending parameter is True, if so then the data elements will be sorted in ascending order. A False value leads to sorting the data elements in descending order . By default, the sort_index() method, performs sorting on row labels in ascending order and returns a copy of the Pandas DataFrame. If the ascending parameter is set to Boolean False, then the sort_index() method performs sorting in descending order. The value_counts() method when chained with a Pandas DataFrame, returns a Series object containing counts of unique values . The resulting items will be in descending order so that the first element is the most frequently-occurring element. The NA values are excluded by default. The index attribute returns the index or row labels of the Series . The values attribute returns a NumPy representation of the Series .

Using Pandas DataFrame assign() method

DataFrame has an assign() method that allows you to easily create new columns that are assuredly derived from existing columns . The assign() method always returns a copy of the data, leaving the original Pandas DataFrame untouched.

Pandas DataFrame Indexing and Selecting Data

The Python and NumPy indexing operators [] and dot operator . provide quick and easy access to select a subset of data elements in a pandas Pandas DataFrame across a wide range of use cases. However, since the type of the data to be  penetrated isn’t known in advance, directly using standard operators has some optimization limits. For production code, it’s highly recommended that you take advantage of the optimized pandas data access methods, like .loc[] and .iloc[], which are used to retrieve rows. Note that .loc[ ] and .iloc[ ] methods are followed by square brackets [ ], not parentheses () and are called as indexers. The .loc[] method is primarily label based, but may also be used with a Boolean array. The .loc[] method will raise KeyError when the items are not found. Inputs accepted by

.loc[] method are a single label, e.g. 5 or ‘a’ (note that 5 is interpreted as a label of the index/row; this use is not an integer position along the index), a list or array of labels [‘a’, ‘b’, ‘c’], a slice items with labels ‘a’:’f’ (note that contrary to usual Python slices, both the start and the stop are included, when present in the index!) and Boolean array. The .iloc[] method is primarily an integer position based (from 0 to length-1 of the axis), but may also be used with a Boolean array. The .iloc[] method will raise an IndexError if a requested indexer is out-of-bounds, except in the case of slice indexers, which allow out-ofbounds indexing (this conforms with Python/NumPy slice semantics). Allowed inputs for .iloc[] method are an integer, such as 5, a list or array of integers [4, 3, 0], a slice object with ints 1:7, and a Boolean array. For example,

For .loc[row_label_indexing, col_label_indexing] and .iloc[row_integer_indexing, col_integer_indexing] methods, a single argument always refers to selecting data elements from row indices in the Pandas DataFrame and not the column indices. When col_label_indexing or col_integer_indexing is absent, it means all the columns for that particular row will be selected. For example, .loc[‘a’] is equivalent to .loc[‘a’,:]. This example applies to iloc as well. With df.loc[indexer] you know automatically that df.loc[] is selecting rows. In difference, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing

details about indexer and df. In, select the first row labeled as row_1 and all the columns of that row. Line selects data elements present in the second row labeled as row_2 and third column labeled as col_3. Selects values present in the first row, row_1 and second row, row_2 along with their corresponding columns. In, you slice via labels and select all the rows under column 2 and 3.

You can grab data based on position instead of labels using .iloc method. The .iloc[] method provides integer-based indexing. The semantics follow Python and NumPy slicing closely. These are zero-based indexing. When slicing, the start bound is carried, while the upper bound is excluded. In, select all the data elements from the second row along the entire columns. Even though we have labeled these rows and columns, still their integer indices range from 0 to n – 1, where n is the length of the data. Slicing returns a subset of data elements present in DataFrame along with their corresponding labels. In _, even though the row index is out of range, still the existing rows will be selected and out-of-range indexes are handled gracefully. All data elements starting from first to the third row along their entire column are selected . The entire Pandas DataFrame is selected

in . Rows from position three onwards and columns from position three onwards are selected . All the rows in the second column are selected . An important feature of pandas is contingent selection using bracket notation, very similar to numpy. Data elements greater than 0.2 in Pandas DataFrame are displayed while the lower values are treated as NaN . Note, none of the above operations change the original data elements of DataFrame.

Group By: split-apply-combine

Here, “group by” refers to a process involving one or more of the following steps:

  • criteria the data are split into group on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.
  1. Out of these, the split step is the most straightforward. In fact, in many postures, we may wish to split the data set into groups and do something with those groups.
  2. In the apply step, we might wish to do one of the following:

Aggregation: compute an abstract statistic (or statistics) for each group. For example,

Compute group sums or means.

Compute group sizes/counts.

Transformation: perform some group-specific computations and return a likeindexed object. For example,

Standardize data (zscore) within a group.

Filling NAs within groups with a value derived from each group.

Filtration: discard some groups, conforming to a group-wise computation that evaluates True or False. For example, Discard data that belong to groups with only a few members. sum or mean are filter out data base on group.

  1. Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it does not fit into either of the above two categories.

For example,

In the above code, cars_df, is the DataFrame on which the groupby() method will be applied. The groupby() method allows you to group rows of data together based on a column name and call aggregate functions on them. For instance, let’s group based on “Company” column using the groupby() method. This will return a Pandas DataFrameGroupBy object upon which you can call aggregate methods. If you need to count only one column then specify the name of the column within brackets as shown in for which you can get values and index labels as a list by chaining tolist() method.

Aggregate Functions:

Aggregate Functions and Their Description:


Compute mean of groups


Compute sum of group values


Compute group sizes


Compute count of group


Standard Deviation of groups


Generate Descriptive Statistics


Compute the minimum of group values


Compute maximum of group values

The group’s attribute is a dictionary whose keys are the computed unique groups and corresponding values are the index labels belonging to each group. Assign the DataFrameGroupBy object returned by groupby() method to a grp_by_company variable. With a grp_by_company object, you can iterate through the grouped data by specifying two iterating variables. Here label variable returns the data elements of the grouped column company and group variable returns the grouped data. Concatenate, Append and Merge The pandas library provides various facilities for easily combining/concatenating together Series as well as Pandas DataFrame objects. The pandas library also has support for full-featured, high-performance in-memory merge operations, also called join operations. The pandas library provides a single function, merge(), as the entry point for all standard merge operations between different Pandas DataFrame objects.

Related Article:

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button