Exploratory Data Analysis
Some sources that are suggested:
- Design and Analysis of Experiments by Angela Dean, Daniel Voss, Danel Dragulic
- The Functional Art - An Introduction to Information Graphics and Visualization by Alberto Cairo
- Seaborn Docs
In this page, i’ll try to provide a cookbook for Data Analysis.
I’m provide some well known python snippets in this section. Prerequisites are knowing what is python pandas library and seaborn.
The basic information you need to visualize is:
- The dataset length (and train and validation set if you create different datasets)
- If columns have some name or convention, visualize them. Also keep a list of unique column values, that will be the features
- The top 5 or 10 random samples
This is very basic but essential information. If the dataset is labelled, you can plot the distribution of labels, this is fundamental for example in classification, since unbalanced data may need different metrics.
df = pd.DataFrame(...)
pd.describe() # most of the information
df['feature'].value_counts(normalize=True)Textual analysis is the analysis of text. The principle is to find numbers in the distribution of words and text. Some examples:
- Average length of texts/document
- Mean, Standard Deviatian, Median, Top, Bottom length
- Number of samples under a certain length
- How many occurencies of a word across a text (or across multiple documents/samples)
- Most common words, least common words
- This can be scaled for example across documents (occurencies of a word across documents and occurencies of a word in a single document are two different values that could be summed up to get occurencies of a word across all the corpus of data)
Some more niche stuff you can find is N-Grams of words.
Mean
With sample distribution you can plot a histogram plot. The mean, median, max and min points could be highlighted
If you have labeled data another useful graph is the boxplot that shows the distribution of quantative data in a way that facilitates comparison between classes.
Top 10 most frequent 5 and 4-grams are useful to find some noise in the data, like HTML tags