Processing Unstructured Data and rectilinear regression and Chi-Square Test in Python

Last updated on Dec 13 2021
Sankalp Agarwal

Table of Contents

Processing Unstructured Data and rectilinear regression and Chi-Square Test in Python

The data that’s already present during a row and column format or which may be easily converted to rows and columns in order that later it can fit nicely into a database is understood as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines aren’t fixed width, or they’re just HTML, image or pdf files. Such data is understood as unstructured data. While the HTML file are often handled by processing the HTML tags, a feed from twitter or a clear text document from a news feed can without having a delimiter doesn’t have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file.

Reading Data

In the below example we take a document and skim the file segregating each of the lines in it. Next we will divide the output into further lines and words. the first file may be a document containing some paragraphs describing the python language.

filename = 'path\input.txt'
with open(filename) as fn:
# Read each line
ln = fn.readline()
# Keep count of lines
lncnt = 1
while ln:
print("Line {}: {}".format(lncnt, ln.strip()))
ln = fn.readline()
lncnt += 1

When we execute the above code, it produces the subsequent result.

Line 1: Python is an interpreted high-level programing language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python features a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and enormous scales.

Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and features a large and comprehensive standard library.

Line 3: Python interpreters are available for several operating systems. CPython, the reference implementation of Python, is open source software and features a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.

Counting Word Frequency

We can count the frequency of the words within the file using the counter function as follows.

from collections import Counter
with open(r'pathinput2.txt') as f:
p = Counter(f.read().split())
print(p)
When we execute the above code, it produces the subsequent result.
Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '199

 Python – Chi-Square Test

Chi-Square test may be a statistical procedure to work out if two categorical variables have a big correlation between them. Both those variables should be from same population and that they should be categorical like − Yes/No, Male/Female, Red/Green etc. for instance , we will build a knowledge set with observations on people’s ice-cream buying pattern and check out to correlate the gender of an individual with the flavour of the ice-cream they like . If a correlation is found we will plan for appropriate stock of flavours by knowing the amount of gender of individuals visiting.

We use various functions in numpy library to hold out the chi-square test.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
fig,ax = plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6]
for df, ls in zip(deg_of_freedom, linestyles):
ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)
plt.xlim(0, 10)
plt.ylim(0, 0.4)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Chi-Square Distribution')
plt.legend()
plt.show()

Its output is as follows −

image001
‘Chi-Square Distribution’

Python – rectilinear regression

In rectilinear regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a line when plotted as a graph. A non-linear relationship where the exponent of any variable isn’t adequate to 1 creates a curve.

The functions in Seaborn to seek out the rectilinear regression relationship is regplot. The below example shows its use.

import seaborn as sb
from matplotlib import pyplot as plt
df = sb.load_dataset('tips')
sb.regplot(x = "total_bill", y = "tip", data = df)
plt.show()

Its output is as follows −

image003
regplot

So, this brings us to the end of blog. This Tecklearn ‘Processing Unstructured Data and Linear Regression and Chi’ blog helps you with commonly asked questions if you are looking out for a job in Python Programming. If you wish to learn Python and build a career in Data Science domain, then check out our interactive, Python with Data Science Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/python-with-data-science/

Python with Data Science Training

About the Course

Python with Data Science training lets you master the concepts of the widely used and powerful programming language, Python. This Python Course will also help you master important Python programming concepts such as data operations, file operations, object-oriented programming and various Python libraries such as Pandas, NumPy, Matplotlib which are essential for Data Science. You will work on real-world projects in the domain of Python and apply it for various domains of Big Data, Data Science and Machine Learning.

Why Should you take Python with Data Science Training?

  • Python is the preferred language for new technologies such as Data Science and Machine Learning.
  • Average salary of Python Certified Developer is $123,656 per annum – Indeed.com
  • Python is by far the most popular language for data science. Python held 65.6% of the data science market.

What you will Learn in this Course?

Introduction to Python

  • Define Python
  • Understand the need for Programming
  • Know why to choose Python over other languages
  • Setup Python environment
  • Understand Various Python concepts – Variables, Data Types Operators, Conditional Statements and Loops
  • Illustrate String formatting
  • Understand Command Line Parameters and Flow control

Python Environment Setup and Essentials

  • Python installation
  • Windows, Mac & Linux distribution for Anaconda Python
  • Deploying Python IDE
  • Basic Python commands, data types, variables, keywords and more

Python language Basic Constructs

  • Looping in Python
  • Data Structures: List, Tuple, Dictionary, Set
  • First Python program
  • Write a Python Function (with and without parameters)
  • Create a member function and a variable
  • Tuple
  • Dictionary
  • Set and Frozen Set
  • Lambda function

OOP (Object Oriented Programming) in Python

  • Object-Oriented Concepts

Working with Modules, Handling Exceptions and File Handling

  • Standard Libraries
  • Modules Used in Python (OS, Sys, Date and Time etc.)
  • The Import statements
  • Module search path
  • Package installation ways
  • Errors and Exception Handling
  • Handling multiple exceptions

Introduction to NumPy

  • Introduction to arrays and matrices
  • Indexing of array, datatypes, broadcasting of array math
  • Standard deviation, Conditional probability
  • Correlation and covariance
  • NumPy Exercise Solution

Introduction to Pandas

  • Pandas for data analysis and machine learning
  • Pandas for data analysis and machine learning Continued
  • Time series analysis
  • Linear regression
  • Logistic Regression
  • ROC Curve
  • Neural Network Implementation
  • K Means Clustering Method

Data Visualisation

  • Matplotlib library
  • Grids, axes, plots
  • Markers, colours, fonts and styling
  • Types of plots – bar graphs, pie charts, histograms
  • Contour plots

Data Manipulation

  • Perform function manipulations on Data objects
  • Perform Concatenation, Merging and Joining on DataFrames
  • Iterate through DataFrames
  • Explore Datasets and extract insights from it

Scikit-Learn for Natural Language Processing

  • What is natural language processing, working with NLP on text data
  • Scikit-Learn for Natural Language Processing
  • The Scikit-Learn machine learning algorithms
  • Sentimental Analysis – Twitter

Introduction to Python for Hadoop

  • Deploying Python coding for MapReduce jobs on Hadoop framework.
  • Python for Apache Spark coding
  • Deploying Spark code with Python
  • Machine learning library of Spark MLlib
  • Deploying Spark MLlib for Classification, Clustering and Regression

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Processing Unstructured Data and rectilinear regression and Chi-Square Test in Python"

Leave a Message

Your email address will not be published. Required fields are marked *