Exploratory Data Analysis

Some sources that are suggested:

Design and Analysis of Experiments by Angela Dean, Daniel Voss, Danel Dragulic
The Functional Art - An Introduction to Information Graphics and Visualization by Alberto Cairo
Seaborn Docs

In this page, i’ll try to provide a cookbook for Data Analysis.

I’m provide some well known python snippets in this section. Prerequisites are knowing what is python pandas library and seaborn.

The basic information you need to visualize is:

The dataset length (and train and validation set if you create different datasets)
If columns have some name or convention, visualize them. Also keep a list of unique column values, that will be the features
The top 5 or 10 random samples

This is very basic but essential information. If the dataset is labelled, you can plot the distribution of labels, this is fundamental for example in classification, since unbalanced data may need different metrics.

df = pd.DataFrame(...)

pd.describe() # most of the information

df['feature'].value_counts(normalize=True)

Textual analysis is the analysis of text. The principle is to find numbers in the distribution of words and text. Some examples:

Average length of texts/document
- Mean, Standard Deviatian, Median, Top, Bottom length
Number of samples under a certain length
How many occurencies of a word across a text (or across multiple documents/samples)
- Most common words, least common words
This can be scaled for example across documents (occurencies of a word across documents and occurencies of a word in a single document are two different values that could be summed up to get occurencies of a word across all the corpus of data)

Some more niche stuff you can find is N-Grams of words.

Mean and Standard are useful to find outliers. If you expect a normal distribution in your data, then often data in range outside are outliers. See also Covariance, Variance and Mean.

With sample distribution you can plot a histogram plot. The mean, median, max and min points could be highlighted

If you have labeled data another useful graph is the boxplot that shows the distribution of quantative data in a way that facilitates comparison between classes.

Top 10 most frequent 5 and 4-grams are useful to find some noise in the data, like HTML tags

Obsidian + 🪴 Quartz 4.0

Exploratory Data Analysis (REVISE)

Exploratory Data Analysis

Graph View

Backlinks

Obsidian + 🪴 Quartz 4.0

Exploratory Data Analysis (REVISE)

Exploratory Data Analysis §

Graph View

Backlinks

Exploratory Data Analysis