Data Mining

Generalize and consolidate data in multidimensional space.

Data Warehouse

Transactional Database

Also called rolled up operations performs data aggregations that can be computed in many dimensions

Consolidation

Drill down

Slicing and dicing

Contrasting technique to consolidation that allows users to navigate in a reverse approach

Drill Down

Consolidation

Integrated, web based analytical processing system owned by IBM

IBM Cognos Analysis Studio

IBM COGNOS

IBM COGNOS REPORT STUDIO

Is used to search for background information about an action/ event and prepare the

IBM Cognos Report Studio

IBM Cognos Analysis Studio

IBM COGNOS

Is a Washington-based company that provides services on BI and mobile software worldwide. MicroStrategy Analytics enables companies/organizations to analyze large volumes of data and distribute the business specific insight throughout the organization securely.

IBM Cognos Analysis Studio

IBM COGNOS

Micro Strategy

Is an MOLAP- multidimensional online analytical processing server typically used as a BI tool for various purposes like controlling and budgeting etc. Is a product of Jedox AG. It has spreadsheet software as its user interface. Allows different users to share a centralized database that acts as a single source of truth.

Micro Strategy

Palo OLAP Server

IBM Cognos Analysis Studio

Is a multidimensional open-source analytics engine. It is designed to provide SQL interface and MOLAP in synchronous with Hadoop to support large data sets. Is developed to reduce query processing time for faster processing of billions of data rows.

Apache Kylin

Palo OLAP Server

Micro Strategy

Switzerland-based company _____ owns a business intelligence software of the same name. It sells an online analytical processing server that is implemented in Java as per J2EE standards. It is an in-memory OLAP server and it is compatible to work with any data source that holds its data in tabular form.

IcCube

Apache Kylin

Micro Strategy

Is a powerful open source tool that provides key BI features like OLAP services, data integration, data mining, extraction-transfer-load (ETL), reporting and dashboard capabilities. ____ is built on Java platform that can work with Windows, Linux and Mac operating systems.

Pentaho BI

Micro Strategy

Apache Kylin

Is a very interactive tool with outstanding features and strengths like its ability to work with categorical data, large data as well as geographical data. It is a general purpose data visualization tool. It consists of interlinked plots and queries.

Mondrian

Micro Strategy

Apache Kylin

A unique platform _____ enables its customers to have deeper insights into the data and helps make faster informative decisions. It offers visual analytics via highly interactive dashboards. It is capable of providing metadata search, in time alerts and powerful operational reporting.

JsHypercube

OBIEE(Oracle Business Intelligence Enterprise Edition)

Is an OLAP database server written in Java programming language. It is a light-weight database.

OBIEE(Oracle Business Intelligence Enterprise Edition)

JsHypercube

Apache Kylin

_____ algorithm is an algorithm that is aimed at reconstructing causality

Teiresias

Alpha

_____ algorithm enables the discovery of rigidity in biological sequences.

Alpha

Teiresias

The α-algorithm is an algorithm used in data mining aimed at reconstructing causality from set of sequences of events.

Alpha

Teiresias

This is designed to operate on database containing transaction for example a collection of items bought by a customer or details of website frequentation. It uses bottom up approach, where frequent subsets are extended one item at a time called candidate generation and groups of candidates are tested against the data.

Alpha

Teiresias

Apriori Algorithm

and this is used for sequence mining. This approach uses priori or level-wise algorithm. One way of using level-wise paradigms is discovering or identifying the frequent items.

Alpha

Teiresias

GSP Generalized Sequential Algorithm Algorithm

A combinatorial algorithm for the discovery of rigid patterns (motifs)in biological sequences, this uses regular expressions to define the patterns, this allows the patterns reported to consist not only of the characters that appear in each position (literals)

Teiresias

Alpha

GSP Generalized Sequential Algorithm Algorithm

B follows a but a never follows b. B depends on a! Written as a->b

Temporal independence

Temporal dependency

Alpha algorithm

Teiresias

There is a trace where a follows b but also a trace where b follows a. A and b can be executed in parallel! Can be written as a || b

Independence

Temporal dependency

Temporal independence

There is no trace where a follows b or b follows a. A and b are independent. Can be written as a # b

Temporal independence

Temporal dependency

Independence

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the

DataFrame.agg()

Groupby()

Method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.

DataFrame.agg()

Groupby()

Split-apply-combine pattern:

Split the data into groups Apply a function to each group independently Combine the results into a data structure

>>titanic.groupby("Sex")["Age"].mean()

Average age of the Titanic passengers?

>>titanic["Age"].mean()

>>titanic[["Age", "Fare"]].median()

Median age and ticket fare price of the Titanic passengers?

>>titanic["Age"].mean()

>>titanic[["Age", "Fare"]].median()

The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function

Titanic[["Age", "Fare"]].describe()

>>titanic["Age"].mean()

The average age for male versus female Titanic passengers

>>titanic[["Sex", "Age"]].groupby("Sex").mean()

Titanic[["Age", "Fare"]].describe()

The apply and combine steps are typically done together in pandas. In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns:

>>titanic.groupby("Sex").mean()

>>titanic.groupby("Sex")["Age"].mean()

It does not make much sense to get the average value of the Pclass. If we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well: So we do the following

>>titanic.groupby("Sex")["Age"].mean()

>>titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

What is the mean ticket fare price for each of the sex and cabin class combinations?

>>titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

>>titanic.groupby("Sex")["Age"].mean()

What is the number of passengers in each of the cabin classes?

>>titanic["Pclass"].value_counts()

>>titanic.groupby("Pclass")["Pclass"].count()

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

>>titanic["Pclass"].value_counts()

>>titanic.groupby("Pclass")["Pclass"].count()

Both ____ and ___ can be used in combination with groupby. Whereas size includes NaN values and just provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use the dropna argument to include or exclude the NaN values.

Size and quantity

Size and count

Number and size

Facilitate the online analytical processing of multidimensional data.

Database

Data cubes

Data Warehouse

The _____ cuboid is the least generalized of all the cuboids in the data cube.

Upper

Base

Lower

The most generalized cuboid is the ___, commonly represented as all

Latest cuboid

Apex cuboid

Top cuboid

An ______ aggregates over one or more dimensions, where each aggregated dimension is indicated by a ∗ in the cell notation.

Aggregate cell

Base cell

We say that a is ________ (i.e., from an m-dimensional cuboid) if exactly m (m ≤ n) values among { a 1 , a 2 ,..., a n } are not ∗. If m = n, then a is a base cell; otherwise, it is an aggregate cell (i.e., where m < n).

N-dimensional cell

M-dimensional cell

Is the task of grouping objects in a manner where the same group is more similar to other groups in some manner, this is used in different fields such as pattern recognition, image analysis, bioinformatics, information retrieval and computer graphics.

Cluster analysis

Database Analysis

Classifying

Is the process of partitioning a set of data objects (or observations) into subsets. Where each subset is a ____

Clustering, Cluster

Grouping, group

Defining, Definition

Hierarchical clustering builds model based on distance connectivity.

Density models:

Connectivity models:

� Centroid model:

Distribution model:

for example, the k-means algorithm represents each cluster by a single mean vector.

Centroid model:

Density models:

Distribution model:

Connectivity models:

Cluster are modeled using statistical distribution, such as multivariate normal distributions used by the Expectation-maximization algorithm.

Centroid model:

Distribution model:

Density models:

In Bi-clustering also known as co-clustering or two-mode- clustering which are modeled with both cluster members and relevant attributes.

Subspace models:

Group models:

With some algorithms do not provide a refined model instead it gives a grouping information.

Group models:

Graph-based models:

S subset of nodes in a graph that the nodes are connected by an edge.

Graph-based models:

Group models:

Also referred to as soft clustering, is a form of clustering in which each data point can belong to more than one cluster.

Fuzzy Clustering:

K-Means Clustering:

A method of vector quantization. This algorithm has loose relationship the k-nearest neighbor classifier.

Fuzzy Clustering:

K-Means Clustering:

A variation of k-means, it is a cluster analysis algorithm, has an effect of minimizing errors over all clusters with reference to -norm distance metric.

K-Means Clustering:

Fuzzy Clustering:

K-Median Clustering:

Another method for hierarchical clustering, it is based on grouping clusters in bottom up fashion (agglomerative clustering), each cluster ids represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different cluster.

Single Linkage Clustering:

K-Median Clustering:

K-Means Clustering:

Group of methods that are effective in high dimensional data applications.

K-Median Clustering:

K-Means Clustering:

Spectral Clustering:

Fuzzy Clustering:

A method of cluster analysis which seeks to build a hierarchy of clusters. There are 2 types (Bolton, 2017):

Hierarchical Clustering:

Spectral Clustering:

K-Median Clustering:

K-Means Clustering:

Bottom up approach, each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy.

Agglomerative:

Divisive:

The opposite of agglomerative approach, thus this top down approach, each observation start in one cluster and splits are performed recursively as one moves down the hierarchy.

Agglomerative:

Divisive:

Utilizing different statistical techniques from data mining, data modelling, machine learning, the current flow of data can be exhausted to provide information something unknown but due to the availability of data and use of different techniques a business owner can come up with an information of what likely is going to happen.

Predictive Analytics:

Decision Support System:

An application that uses massive of data to support determinations, judgments, and courses of action by providing information for decision making, the information can be use by computer itself or whoever in the organization has the authority to carry it out.

Decision Support System:

Web Mining:

Refers to the application of data mining techniques to discover patterns, structures, and knowledge from the Web. According to analysis targets, web mining can be organized into three main areas: web content mining, web structure mining, and web usage mining.

Web Mining:

Decision Support System:

Let a = (a 1 , a 2 ,..., a n , measures)

N dimensional data cube

M dimensional data cube

(Jan, ∗ , ∗ , 2800) and (∗, Chicago, ∗ , 1200) are 1-D cells; (Jan, ∗ , Business, 150) is a 2-D cell; and (Jan, Chicago, Business, 45) is a 3-D cell. Here, all base cells are 3-D, whereas 1-D and 2-D cells are aggregate cells.

Base and aggregate cells. Consider a data cube with the dimensions month, city, and customer group, and the measure sales.

Ayan na yung sagot

Data Mining

More Surveys