In our previous Python tutorial, we have explained How To Use Lambda Function in Python. In this tutorial, we will explain How To Use Pandas library in Python.
Pandas and python makes data science and analytics extremely easy and effective. Pandas is an open source Python library that allows the handling of tabular data.
We will cover following in this tutorial:
- What is pandas?
- What is data science or data analytics?
- What Can Pandas Do?
- Pandas installation
- Pandas Series
- Pandas DataFrames
What is Pandas?
Pandas is a Python library, Wes McKinney in 2008. It was mainly built to help to work with datasets in Python for finance related work.
Pandas is a widely used open source Python library for data science. It was was build to work with two-dimensional data structure called a DataFrame similar to Excel spreadsheets. It provides fast, flexible, high-performance, easy-to-use structures, and data analysis tools. It is used for working with datasets for analyzing, exploring, manipulating data, cleaning messy data sets, and make them readable and relevant as relevant data is important in data science.
What is data science or data analytics?
Data science or data analytics is a process of analyzig large set of data points to get ansers on questions related to that data set. Pandas is a Python library that makes data science easy and effective.
What Can Pandas Do?
Pandas desgined to work with data sets. With Pandas, we can get the corelations between two or more coloumns. We can also get avarage value, max and min value. We can also clean the messey data sets and delete rows that are not relevant, or have worng values.
Python pandas can be used for different kinds of data, such as:
- Ordered and unordered data.
- Unlabeled data.
- Messy data sets.
- Any type of observational or statistical data sets.
Pandas Installation
Now we will install Pandas library. If Python and PIP already installed on a system, then installation of Pandas is very easy. You can install it using below command:
pip install pandas
If above command fails due to any reason, then you can use a Python distribution that already has Pandas installed like, Anaconda, Spyder etc.
Operations on Pandas Series
The Pandas series is a one-dimensional array that contain any type of data. The series can be created using the following constructor:
pandas.Series(data, index, dtype, copy)
Now we will create a empty series by importing Padas:
import pandas as pd s = pd.Series() print (s)
The above will output following:
Series([], dtype: float64)
There are number of ways to create a Pandas series.We can use lists, array and dictionary to create series. We will use these variables to create series.
Create Pandas Series from a Python List
Now we will create series using passing Python list.
import pandas as pd myList = [10, 20, 30, 40, 50] s = pd.Series(myList) print (s)
When we run above code, it will output series like below:
0 10 1 20 2 30 3 40 4 50 dtype: int64
The output is returned as two coloumn. As the series allows labeling, so the first coloumn is of lebel and second is the data from list.
We can add our own label by passing labels list and data list:
import pandas as pd labels = ['a', 'b', 'c', 'd', 'e'] myList = [10, 20, 30, 40, 50] s = pd.Series(myList, index=labels) print (s)
When we run above code, it will output series with label and data like below:
a 10 b 20 c 30 d 40 e 50 dtype: int64
The main advantage of using labels is that it allows to reference an element of the Series using its label instead of its numerical index.
Create Pandas Series from a Dictionary
We can also pass in a dictionary to create a pandas Series.
import pandas as pd dict = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50} s = pd.Series(dict) print (s)
When we run above code, it will output series from dictionary with label and data like below:
a 10 b 20 c 30 d 40 e 50 dtype: int64
Create Pandas Series from NumPy Arrays
We can pass NumPy Arrays to create Pandas Series. Here we will import NumPy module and create array. Then pass that array to create Series.
import pandas as pd import numpy as np myArray = np.array([10, 20, 30, 40, 50]) s = pd.Series(myArray) print (s)
When we run above code, it will output series from NumPy Array data like below:
0 10 1 20 2 30 3 40 4 50 dtype: int32
Accessing Data From Pandas Series
We can access the data in the series by entering the index number of the element or the label on an element.
Accessing Series Data By Using Index
Here we will access series data by index:
import pandas as pd import numpy as np myArray = np.array([10, 20, 30, 40, 50]) s = pd.Series(myArray) print (s[0]) print (s[4])
When we run above code, it will output data like below:
10 50
Accessing Series Data By Using Label
Here we will access Series data by label.
import pandas as pd dict = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50} s = pd.Series(dict) print (s['a']) print (s['e'])
When we run above code, it will output data like below:
10 50
Pandas DataFrame
Pandas DataFrame is a 2 dimensional data structure like a 2-dimensional array, or a table in which data is arranged in the form of rows and columns. We can create a DataFrame using the following constructor:
pandas.DataFrame(data, index, columns, dtype, copy)
Now we will create a empty DataFrame by importing Padas:
import pandas as pd df = pd.DataFrame() print (df)
The above will output following empty DataFrame:
Empty DataFrame Columns: [] Index: []
Create a DataFrame from Python List
We can create a DataFrame by passing a simple data list.
import pandas as pd dataList = [1, 2, 3, 4, 5] df = pd.DataFrame(dataList) print (df)
The above program will output folliwng DataFrame with default indexes and values:
0 0 1 1 2 2 3 3 4 4 5
We can also pass data list array and coloumns to create DataFrame:
import pandas as pd dataList = [['smith', 20, 'India'],['william', 30, 'France'],['steve', 40, 'Britain'],['Andy', 35, 'Canada'],['Gary', 50, 'USA']] df = pd.DataFrame(dataList, columns = ['Name', 'Age', 'Country']) print (df)
The above program will output folliwng DataFrame with coloumns label and values:
Name Age Country 0 smith 20 India 1 william 30 France 2 steve 40 Britain 3 Andy 35 Canada 4 Gary 50 USA
Creating a DataFrame from a Series Dictionary
We can also create a DataFrame by passing a series dictionary. Here we are passing series dictionary to form a DataFrame.
import pandas as pd dict = {'India': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'Japan': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])} df = pd.DataFrame(dict) print (df)
The above program will output folliwng DataFrame:
India Japan a 1 1 b 2 2 c 3 3 d 4 4 e 5 5
Accessing Column
We can access a particular column by mentioning the column name. Here we are getting DataFrame by coloumn name.
import pandas as pd dict = {'India': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'Japan': pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])} df = pd.DataFrame(dict) print (df['Japan'])
The above program will output folliwng DataFrame:
a 1 b 2 c 3 d 4 e 5 Name: Japan, dtype: int64
Adding New column
We can add a new coloumn to DataFrame by assigning series data new coloumn.
import pandas as pd dict = {'India': pd.Series([7, 9, 13, 15, 35], index=['a', 'b', 'c', 'd', 'e']), 'Japan': pd.Series([5, 10, 15, 20, 25], index=['a', 'b', 'c', 'd', 'e'])} df = pd.DataFrame(dict) # Adding column df['France'] = pd.Series([10, 20, 30, 40, 50, 60, 70], index=['a', 'b', 'c', 'd', 'e', 'f', 'g']) print (df)
The above program will output folliwng DataFrame after adding new column:
India Japan France a 7 5 10 b 9 10 20 c 13 15 30 d 15 20 40 e 35 25 50
Delete Column
We can delete a column from DataList using del
or pop
function.
Here deleteing column using del
and pop
function:
import pandas as pd dict = {'India': pd.Series([7, 9, 13, 15, 35], index=['a', 'b', 'c', 'd', 'e']), 'Japan': pd.Series([5, 10, 15, 20, 25], index=['a', 'b', 'c', 'd', 'e']), 'France' : pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])} df = pd.DataFrame(dict) # Delete a column using del function del df['France'] # Delete a column using pop function df.pop('Japan') print (df)
The above program will output folliwng DataFrame after deleting two column:
India a 7 b 9 c 13 d 15 e 35
Indexing a DataFrame
We can do integer-based indexing with DataFrame using iloc()
method.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 5), columns = ['A', 'B', 'C', 'D', 'E']) print (df.iloc[:5])
The above program will output following:
A B C D E 0 -0.469348 -0.596175 0.086608 0.651538 -1.191260 1 -0.664254 -0.901478 0.623666 -0.205776 -0.034960 2 1.349643 -1.349104 -0.757116 0.387509 1.166415 3 -2.437482 -0.006055 -0.682298 -0.039461 0.069462 4 -0.038990 0.048944 2.251811 0.353188 -1.451316
Conclusion
In this tutorial, we have covered about Python Pandas and its functions to use Pandas Serias and DataFrame. We will try to cover more functions related to Python Pandas in other tutorials. If you have any questions or comments, you can post them in comments section to get back to you.