# Python Pandas DataFrame basics

**Pandas DataFrame**

Pandas DataFrame is a 2-dimensional, titled data structure with columns of potentially different types. You can think of it like a spreadsheet or database table, or a dict of Series objects. It is the most commonly used pandas object. Pandas DataFrame accepts many different kinds of input like Dict of one-dimensional ndarrays, lists, dicts, or Series, two-dimensional ndarrays, structured or record ndarray, a dictionary of Series, or another DataFrame.

df = pd.DataFrame(data=None, index=None, columns=None)

Here, df is the Pandas DataFrame and data can be NumPy ndarray, dict, or DataFrame. Along with the data, you can optionally pass an index (row labels) and columns (column labels) attributes as arguments. If you pass an index and/or columns, you are warranting the index and/or columns of the resulting Pandas DataFrame. Both index and columns will default to range(n) where n is the length of data, if they are not provided. When the data is a dictionary and columns are not specified, then the Pandas DataFrame column labels will be the dictionary’s keys.

*Create Pandas DataFrame from Dictionary of Series/Dictionaries*

*Create Pandas DataFrame from Dictionary of Series/Dictionaries*

>>> import pandas as pd >>> dict_series = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), ... 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} >>> df = pd.DataFrame(dict_series) >>> df one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0 index columns >>> df.shape (4, 2) >>> df.index Index(['a', 'b', 'c', 'd'], dtype='object') >>> df.columns Index(['one', 'two'], dtype='object') >>> list(df.columns) ['one', 'two'] >>> dicts_only = {'a':[1,2,3], 'b':[4,5,6]} 381 Introduction to Data Science 10. >>> dict_df = pd.DataFrame(dicts_only) 11. >>> dict_df a b 0 1 4 1 2 5 2 3 6 >>> dict_df.index RangeIndex(start=0, stop=3, step=1)

You need to import pandas library. Create a dictionary whose values are a series of a one-dimensional arrays. You can create a Pandas DataFrame from a dictionary of series. Pass *dict_series *dictionary as argument to *DataFrame() *class which returns a Pandas DataFrame object. Index labels are passed as a list to index attribute. If the number of labels specified in the various series are not the same, then the resulting index will be the union of all the index labels of various series. In DataFrame *df *under column “*one*“, there is no data element associated with index label “*d*“, so NaN will be inserted at that position. A number of rows and columns in a Pandas DataFrame is obtained using *shape *attribute. Get the index labels for the Pandas DataFrame using *index *attribute. With *columns* attribute, you get all the columns of the DataFrame. Get the columns of DataFrame as a list by passing the *columns *attribute as an argument to the *list() *function. You can create a Pandas DataFrame from a dictionary without using *index *and *columns *attribute. For *dict_df *DataFrame , *a *and *b *are columns and index labels are integers ranging from zero to two .

*Create DataFrame from ndarrays/lists/list of dictionaries*

*Create DataFrame from ndarrays/lists/list of dictionaries*

>>> import numpy as np >>> import pandas as pd >>> dict_ndarrays = {'one': np.random.random(5), 'two':np.random.random(5)} >>> pd.DataFrame(dict_ndarrays) one two 0 0.346580 0.827881 1 0.738850 0.577504 2 0.969715 0.781170 3 0.668432 0.746535 4 0.709333 0.440675 >>> pd.DataFrame([[1,2,3,4,5], [6,7,8,9,10]]) 0 1 2 3 4 0 1 2 3 4 5 1 6 7 8 9 10 columns rows >>> dict_lists = {'one': [1, 2, 3, 4, 5], 'two': [5, 4, 3, 2, 1]} >>> pd.DataFrame(dict_lists) 382 Introduction to Python Programming one two 0 1 5 1 2 4 2 3 3 3 4 2 4 5 1 >>> pd.DataFrame(dict_lists, index=['a', 'b', 'c', 'd', 'e']) one two a 1 5 b 2 4 c 3 3 d 4 2 e 5 1 >>> lists_dicts = [{'a':1, 'b':2}, {'a':5, 'b':10, 'c':20}] >>> pd.DataFrame(lists_dicts) a b c 0 1 2 NaN 1 5 10 20.0

The pandas library is built on top of NumPy. Here, *dict_ndarrays * is a dictionary of *ndarrays *from which you can create a Pandas DataFrame. Also, nested lists can be used to create a Pandas DataFrame. If no index and columns are specified, then both index and columns will have integer labels. *Keys *are considered as *column *labels when a Pandas DataFrame is created using dictionaries. The Pandas DataFrame columns will be preserved in the same order as specified by dictionary *keys*. In, index labels are specified for a DataFrame created from the dictionary. The Pandas DataFrame can also be created from a list of dictionaries. Since DataFrame columns will be a union of all the keys in the list of dictionaries, elements for missing columns will be NaN.

*Pandas DataFrame Column Selection, Addition and Deletion*

*Pandas DataFrame Column Selection, Addition and Deletion*

>>> import pandas as pd >>> la_liga = {"Ranking":[1,2,3], "Team": ["Barcelona", "Atletico Madrid", "Real Madrid"]} >>> df = pd.DataFrame(la_liga) >>> df Ranking Team 0 1 Barcelona 1 2 Atletico Madrid 2 3 Real Madrid >>> df['Team'] 0 Barcelona 1 Atletico Madrid 2 Real Madrid Name: Team, dtype: object >>> df['Played'] = [34, 36, 38] >>> df['Won'] = [27, 23, 22] >>> df[['Played', 'Won']] Played Won 0 34 27 1 36 23 2 38 22 >>> df['Points'] = df['Won'] * 2 >>> df Ranking Team Played Won Points 0 1 Barcelona 34 27 54 1 2 Atletico Madrid 36 23 46 2 3 Real Madrid 38 22 44 >>> df['Lost'] = [1, 5, 6] >>> df Ranking Team Played Won Points Lost 0 1 Barcelona 34 27 54 1 1 2 Atletico Madrid 36 23 46 5 2 3 Real Madrid 38 22 44 6 >>> df['Drawn'] = df['Played'] - df['Won'] - df['Lost'] >>> df Ranking Team Played Won Points Lost Drawn 0 1 Barcelona 34 27 54 1 6 1 2 Atletico Madrid 36 23 46 5 8 2 3 Real Madrid 38 22 44 6 10 >>> df['Year'] = 2018 >>> df Ranking Team Played Won Points Lost Drawn year 0 1 Barcelona 34 27 54 1 6 2018 1 2 Atletico Madrid 36 23 46 5 8 2018 2 3 Real Madrid 38 22 44 6 10 2018 >>> del df['Year'] >>> df.pop('Drawn') 0 6 1 8 2 10 Name: Drawn, dtype: int64 >>> df.insert(5, 'Goal Difference', [63, 38, 42]) >>> df Ranking Team Played Won Points Goal Difference Lost 0 1 Barcelona 34 27 54 63 1 1 2 Atletico Madrid 36 23 46 38 5 2 3 Real Madrid 38 22 44 42 6 >>> df.rename(columns = {'Team':'Club Team'}) Ranking Club Team Played Won Points Goal Difference Lost 0 1 Barcelona 34 27 54 63 1 1 2 Atletico Madrid 36 23 46 38 5 2 3 Real Madrid 38 22 44 42 6

Create Pandas DataFrame df from la_liga dictionary. You can select a particular column in a DataFrame by specifying the column name within quotes inside a bracket of a Pandas DataFrame. You can add a new column to the DataFrame by specifying the column label within the bracket of DataFrame and assign data elements to it. Grab multiple columns from a Pandas DataFrame by passing a list of columns. You can also create a new column by making use of the data elements found in existing columns. Column “Points” is inserted to the Pandas DataFrame df after multiplying all the data elements in column “Won” by 2. You can perform basic arithmetic operations on Pandas DataFrame columns . When inserting a scalar value, it will congenitally be propagated to fill the column. Columns can be deleted or popped . By default, columns get inserted at the end. The insert function is available to insert at a dainty location in the columns . You can rename the column label using the rename() method. The columns attribute has to be passed to the rename() method and assign it with a dictionary where the old column label will be key and new column label will be a value of . All the above operations have a direct impact on the Pandas DataFrame.

*Displaying Data in Pandas DataFrame*

*Displaying Data in Pandas DataFrame*

>>> import pandas as pd >>> df = pd.DataFrame({'WorldCup_Winner':["Brazil", "Germany", "Argentina", "Brazil", "Spain"], 'Year':[1962, 1974, 1986, 2002, 2010]}) >>> df.columns Index(['WorldCup_Winner', 'Year'], dtype='object') >>> df.head(2) WorldCup_Winner Year 0 Brazil 1962 1 Germany 1974 >>> df.tail(2) WorldCup_Winner Year 3 Brazil 2002 4 Spain 2010 >>> df['WorldCup_Winner'].unique() array(['Brazil', 'Germany', 'Argentina', 'Spain'], dtype=object) 7. >>> df['WorldCup_Winner'].unique().tolist() ['Brazil', 'Germany', 'Argentina', 'Spain'] >>> df.transpose() 0 1 2 3 4 WorldCup_Winner Brazil Germany Argentina Brazil Spain Year 1962 1974 1986 2002 2010 >>> df.sort_values(by=['Year'], ascending = False) WorldCup_Winner Year 4 Spain 2010 3 Brazil 2002 2 Argentina 1986 1 Germany 1974 0 Brazil 1962 >>> df.sort_index(ascending = False) WorldCup_Winner Year 4 Spain 2010 3 Brazil 2002 2 Argentina 1986 1 Germany 1974 0 Brazil 1962 >>> df['WorldCup_Winner'].value_counts() Brazil 2 Argentina 1 Germany 1 Spain 1 Name: WorldCup_Winner, dtype: int64 >>> df['WorldCup_Winner'].value_counts().index.tolist() ['Brazil', 'Argentina', 'Germany', 'Spain'] 13. >>> df['WorldCup_Winner'].value_counts().values.tolist() [2, 1, 1, 1]

DataFrame head(n) method returns first n rows and tail(n) method returns last n rows. You can find unique data elements in a column by chaining unique() method with a Pandas DataFrame column using dot notation. The unique() method returns a one-dimensional array-like object, which can be converted to a list using tolist() method. The transpose() method flips the DataFrame over its main diagonal by writing rows as columns and vice versa. The syntax for sort_values() method is,

df.sort_values(by, axis=0, ascending=True)

where the by parameter can be a string, list of strings, index label, column label, list of index labels, or list of column labels to sort by. If the value of the axis is 0 then by may contain column labels. If the value of the axis is 1, then by may contain index labels. By default, the value of the axis parameter is 0. The default value of ascending parameter is True, if so then the data elements will be sorted in ascending order. A False value leads to sorting the data elements in descending order . By default, the sort_index() method, performs sorting on row labels in ascending order and returns a copy of the Pandas DataFrame. If the ascending parameter is set to Boolean False, then the sort_index() method performs sorting in descending order. The value_counts() method when chained with a Pandas DataFrame, returns a Series object containing counts of unique values . The resulting items will be in descending order so that the first element is the most frequently-occurring element. The NA values are excluded by default. The index attribute returns the index or row labels of the Series . The values attribute returns a NumPy representation of the Series .

*Using Pandas DataFrame assign() method*

*Using Pandas DataFrame assign() method*

>>> import pandas as pd >>> df_mountain = pd.DataFrame({"Mountain":['Mount Everest', 'K2', 'Kangchenjunga'], "Length":[8848, 8611, 8586]}) >>> df_mountain.assign(Ranking = [1, 2, 3]) Length Mountain Ranking 0 8848 Mount Everest 1 1 8611 K2 2 2 8586 Kangchenjunga 3 >>> df = pd.DataFrame({'A':[2, 4, 6], 'B':[3, 6, 9]}) >>> df.assign(C = lambda x:x['A'] ** 2) A B C 0 2 3 4 1 4 6 16**** 2 6 9 36

DataFrame has an assign() method that allows you to easily create new columns that are assuredly derived from existing columns . The assign() method always returns a copy of the data, leaving the original Pandas DataFrame untouched.

*Pandas DataFrame **Indexing and Selecting Data*

*Pandas DataFrame*

*Indexing and Selecting Data*

The Python and NumPy indexing operators [] and dot operator . provide quick and easy access to select a subset of data elements in a pandas Pandas DataFrame across a wide range of use cases. However, since the type of the data to be penetrated isn’t known in advance, directly using standard operators has some optimization limits. For production code, it’s highly recommended that you take advantage of the optimized pandas data access methods, like .loc[] and .iloc[], which are used to retrieve rows. Note that .loc[ ] and .iloc[ ] methods are followed by square brackets [ ], not parentheses () and are called as indexers. The .loc[] method is primarily label based, but may also be used with a Boolean array. The .loc[] method will raise KeyError when the items are not found. Inputs accepted by

.loc[] method are a single label, e.g. 5 or ‘a’ (note that 5 is interpreted as a label of the index/row; this use is not an integer position along the index), a list or array of labels [‘a’, ‘b’, ‘c’], a slice items with labels ‘a’:’f’ (note that contrary to usual Python slices, both the start and the stop are included, when present in the index!) and Boolean array. The .iloc[] method is primarily an integer position based (from 0 to length-1 of the axis), but may also be used with a Boolean array. The .iloc[] method will raise an IndexError if a requested indexer is out-of-bounds, except in the case of slice indexers, which allow out-ofbounds indexing (this conforms with Python/NumPy slice semantics). Allowed inputs for .iloc[] method are an integer, such as 5, a list or array of integers [4, 3, 0], a slice object with ints 1:7, and a Boolean array. For example,

>>> import numpy as np >>> import pandas as pd >>> df = pd.DataFrame(np.random.rand(5,5), index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5'], columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5']) >>> df col_1 col_2 col_3 col_4 col_5 row_1 0.302179 0.067154 0.848890 0.291533 0.710989 row_2 0.668777 0.246157 0.339020 0.232109 0.390328 row_3 0.787487 0.703837 0.542948 0.839311 0.050887 row_4 0.905814 0.026933 0.381502 0.754635 0.399242 row_5 0.244861 0.343171 0.992433 0.058433 0.266207 >>> df.loc['row_1'] col_1 0.302179 col_2 0.067154 col_3 0.848890 col_4 0.291533 col_5 0.710989 Name: row_1, dtype: float64 >>> df.loc['row_2', 'col_3'] 0.339020 >>> df.loc[['row_1', 'row_2'],['col_2', 'col_3']] col_2 col_3 row_1 0.067154 0.84889 row_2 0.246157 0.33902 >>> df.loc[:, ['col_2', 'col_3']] col_2 col_3 row_1 0.067154 0.848890 row_2 0.246157 0.339020 row_3 0.703837 0.542948 row_4 0.026933 0.381502 row_5 0.343171 0.992433 >>> df.iloc[1] col_1 0.668777 col_2 0.246157 col_3 0.339020 col_4 0.232109 col_5 0.390328 Name: row_2, dtype: float64 >>> df.iloc[3:5, 0:2] col_1 col_2 row_4 0.905814 0.026933 row_5 0.244861 0.343171 >>> df.iloc[:3, :] col_1 col_2 col_3 col_4 col_5 row_1 0.302179 0.067154 0.848890 0.291533 0.710989 row_2 0.668777 0.246157 0.339020 0.232109 0.390328 row_3 0.787487 0.703837 0.542948 0.839311 0.050887 >>> df.iloc[:,:] col_1 col_2 col_3 col_4 col_5 row_1 0.302179 0.067154 0.848890 0.291533 0.710989 row_2 0.668777 0.246157 0.339020 0.232109 0.390328 row_3 0.787487 0.703837 0.542948 0.839311 0.050887 row_4 0.905814 0.026933 0.381502 0.754635 0.399242 row_5 0.244861 0.343171 0.992433 0.058433 0.266207 >>> df.iloc[2:, 2:] col_3 col_4 col_5 row_3 0.542948 0.839311 0.050887 row_4 0.381502 0.754635 0.399242 row_5 0.992433 0.058433 0.266207 >>> df.iloc[:,1] row_1 0.067154 row_2 0.246157 row_3 0.703837 row_4 0.026933 row_5 0.343171 Name: col_2, dtype: float64 >>> df[df > 0.2] col_1 col_2 col_3 col_4 col_5 row_1 0.302179 NaN 0.848890 0.291533 0.710989 row_2 0.668777 0.246157 0.339020 0.232109 0.390328 row_3 0.787487 0.703837 0.542948 0.839311 NaN row_4 0.905814 NaN 0.381502 0.754635 0.399242 row_5 0.244861 0.343171 0.992433 NaN 0.266207

For .loc[row_label_indexing, col_label_indexing] and .iloc[row_integer_indexing, col_integer_indexing] methods, a single argument always refers to selecting data elements from row indices in the Pandas DataFrame and not the column indices. When col_label_indexing or col_integer_indexing is absent, it means all the columns for that particular row will be selected. For example, .loc[‘a’] is equivalent to .loc[‘a’,:]. This example applies to iloc as well. With df.loc[indexer] you know automatically that df.loc[] is selecting rows. In difference, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing

details about indexer and df. In, select the first row labeled as row_1 and all the columns of that row. Line selects data elements present in the second row labeled as row_2 and third column labeled as col_3. Selects values present in the first row, row_1 and second row, row_2 along with their corresponding columns. In, you slice via labels and select all the rows under column 2 and 3.

You can grab data based on position instead of labels using .iloc method. The .iloc[] method provides integer-based indexing. The semantics follow Python and NumPy slicing closely. These are zero-based indexing. When slicing, the start bound is carried, while the upper bound is excluded. In, select all the data elements from the second row along the entire columns. Even though we have labeled these rows and columns, still their integer indices range from 0 to n – 1, where n is the length of the data. Slicing returns a subset of data elements present in DataFrame along with their corresponding labels. In _, even though the row index is out of range, still the existing rows will be selected and out-of-range indexes are handled gracefully. All data elements starting from first to the third row along their entire column are selected . The entire Pandas DataFrame is selected

in . Rows from position three onwards and columns from position three onwards are selected . All the rows in the second column are selected . An important feature of pandas is contingent selection using bracket notation, very similar to numpy. Data elements greater than 0.2 in Pandas DataFrame are displayed while the lower values are treated as NaN . Note, none of the above operations change the original data elements of DataFrame.

*Group By: split-apply-combine*

*Group By: split-apply-combine*

Here, “group by” refers to a process involving one or more of the following steps:

- criteria the data are split into group on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

- Out of these, the split step is the most straightforward. In fact, in many postures, we may wish to split the data set into groups and do something with those groups.
- In the apply step, we might wish to do one of the following:

Aggregation: compute an abstract statistic (or statistics) for each group. For example,

**Compute group sums or means.**

**Compute group sizes/counts.**

Transformation: perform some group-specific computations and return a likeindexed object. For example,

Standardize data (zscore) within a group.

Filling NAs within groups with a value derived from each group.

Filtration: discard some groups, conforming to a group-wise computation that evaluates True or False. For example, Discard data that belong to groups with only a few members. sum or mean are filter out data base on group.

- Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it does not fit into either of the above two categories.

For example,

>>> import pandas as pd >>> cars_data = {'Company':['General Motors','Ford', 'Toyota', 'General Motors', 'Ford', 'Toyota'], 'Model': ['Camaro', 'Mustang', 'Prius', 'Malibu', 'Fiesta', 'Camry'], 'Sold':[12285, 35273, 34287, 29325, 27459, 17621]} >>> cars_df = pd.DataFrame(cars_data) >>> cars_df Company Model Sold 0 General Motors Camaro 12285 1 Ford Mustang 35273 2 Toyota Prius 34287 3 General Motors Malibu 29325 4 Ford Fiesta 27459 5 Toyota Camry 17621 5. >>> cars_df.groupby('Company').mean() Company Sold Ford 31366 General Motors 20805 Toyota 25954 >>> cars_df.groupby('Company').std() Company Sold Ford 5525.332388 391 Introduction to Data Science General Motors 12049.099551 Toyota 11784.641615 >>>cars_df.groupby('Company').min() Company Model Sold Ford Fiesta 27459 General Motors Camaro 12285 Toyota Camry 17621 >>> cars_df.groupby('Company').max() Company Model Sold Ford Mustang 35273 General Motors Malibu 29325 Toyota Prius 34287 >>> cars_df.groupby('Company').sum() Company Sold Ford 62732 General Motors 41610 Toyota 51908 >>> cars_df.groupby('Company').describe() Sold Company count mean std min 25% 50% 75% max Ford 2.0 31366.0 5525.332388 27459.0 29412.5 31366.0 33319.5 35273.0 General Motors 2.0 20805.0 12049.099551 12285.0 16545.0 20805.0 25065.0 29325.0 Toyota 2.0 25954.0 11784.641615 17621.0 21787.5 25954.0 30120.5 34287.0 >>> cars_df.groupby('Company').count() Company Model Sold Ford 2 2 General Motors 2 2 Toyota 2 2 >>> cars_df.groupby('Company')['Company'].count() Company Ford 2 General Mot ors 2 Toyota 2 Name: Company, dtype: int64 >>> cars_df.groupby('Company')['Company'].count().tolist() [2, 2, 2] >>> cars_df.groupby('Company')['Company'].count().index.tolist() ['Ford', 'General Motors', 'Toyota'] >>> cars_df.groupby(['Company','Model']).groups {('Ford', 'Fiesta'): Int64Index([4], dtype='int64'), ('Ford', 'Mustang'): Int64Index([1], dtype='int64'), ('General Motors', 'Camaro'): Int64Index([0], dtype='int64'), ('General Motors', 'Malibu'): Int64Index([3], dtype='int64'), ('Toyota', 'Camry'): Int64Index([5], dtype='int64'), ('Toyota', 'Prius'): Int64Index([2], dtype='int64')} >>> grp_by_company = cars_df.groupby('Company') >>> for label, group in grp_by_company: ... print(label) ... print(group) ... Ford Company Model Sold 1 Ford Mustang 35273 4 Ford Fiesta 27459 General Motors Company Model Sold 0 General Motors Camaro 12285 3 General Motors Malibu 29325 Toyota Company Model Sold 2 Toyota Prius 34287 5 Toyota Camry 17621

In the above code, cars_df, is the DataFrame on which the *groupby() *method will be applied. The *groupby() *method allows you to group rows of data together based on a column name and call aggregate functions on them. For instance, let’s group based on “Company” column using the *groupby() *method. This will return a *Pandas DataFrameGroupBy* object upon which you can call aggregate methods. If you need to count only one column then specify the name of the column within brackets as shown in for which you can get values and index labels as a list by chaining *tolist() *method.

**Aggregate Functions:**

Aggregate Functions and Their Description:

**mean()**

Compute mean of groups

**sum()**

Compute sum of group values

**size()**

Compute group sizes

**count()**

Compute count of group

**std()**

Standard Deviation of groups

**describe()**

Generate Descriptive Statistics

**min()**

Compute the minimum of group values

**max()**

Compute maximum of group values

The *group’s *attribute is a dictionary whose keys are the computed unique groups and corresponding values are the index labels belonging to each group. Assign the *DataFrameGroupBy *object returned by *groupby() *method to a *grp_by_company *variable. With a *grp_by_company *object, you can iterate through the grouped data by specifying two iterating variables. Here *label *variable returns the data elements of the grouped column company and *group *variable returns the grouped data. *Concatenate, Append and Merge *The pandas library provides various facilities for easily *combining/concatenating *together Series as well as Pandas DataFrame objects. The pandas library also has support for full-featured, high-performance in-memory *merge *operations, also called *join *operations. The pandas library provides a single function, *merge()*, as the entry point for all standard merge operations between different Pandas DataFrame objects.