Data Mining
Generalize and consolidate data in multidimensional space.
Data Warehouse
Transactional Database
Olap?
Online Analytical Processing
Onion Lapis Aso Pusa
Also called rolled up operations performs data aggregations that can be computed in many dimensions
Consolidation
Drill down
Slicing and dicing
Contrasting technique to consolidation that allows users to navigate in a reverse approach
Drill Down
Consolidation
A complete toolkit for building data pipelines
IcCube
Xplenty
Ibm cognos
Integrated, web based analytical processing system owned by IBM
IBM Cognos Analysis Studio
IBM COGNOS
IBM COGNOS REPORT STUDIO
Is used to search for background information about an action/ event and prepare the
IBM Cognos Report Studio
IBM Cognos Analysis Studio
IBM COGNOS
Is a Washington-based company that provides services on BI and mobile software worldwide. MicroStrategy Analytics enables companies/organizations to analyze large volumes of data and distribute the business specific insight throughout the organization securely.
IBM Cognos Analysis Studio
IBM COGNOS
Micro Strategy
Is an MOLAP- multidimensional online analytical processing server typically used as a BI tool for various purposes like controlling and budgeting etc. Is a product of Jedox AG. It has spreadsheet software as its user interface. Allows different users to share a centralized database that acts as a single source of truth.
Micro Strategy
Palo OLAP Server
IBM Cognos Analysis Studio
Is a multidimensional open-source analytics engine. It is designed to provide SQL interface and MOLAP in synchronous with Hadoop to support large data sets. Is developed to reduce query processing time for faster processing of billions of data rows.
Apache Kylin
Palo OLAP Server
Micro Strategy
Switzerland-based company _____ owns a business intelligence software of the same name. It sells an online analytical processing server that is implemented in Java as per J2EE standards. It is an in-memory OLAP server and it is compatible to work with any data source that holds its data in tabular form.
IcCube
Apache Kylin
Micro Strategy
Is a powerful open source tool that provides key BI features like OLAP services, data integration, data mining, extraction-transfer-load (ETL), reporting and dashboard capabilities. ____ is built on Java platform that can work with Windows, Linux and Mac operating systems.
Pentaho BI
Micro Strategy
Apache Kylin
Is a very interactive tool with outstanding features and strengths like its ability to work with categorical data, large data as well as geographical data. It is a general purpose data visualization tool. It consists of interlinked plots and queries.
Mondrian
Micro Strategy
Apache Kylin
A unique platform _____ enables its customers to have deeper insights into the data and helps make faster informative decisions. It offers visual analytics via highly interactive dashboards. It is capable of providing metadata search, in time alerts and powerful operational reporting.
JsHypercube
OBIEE(Oracle Business Intelligence Enterprise Edition)
Is an OLAP database server written in Java programming language. It is a light-weight database.
OBIEE(Oracle Business Intelligence Enterprise Edition)
JsHypercube
Apache Kylin
_____ algorithm is an algorithm that is aimed at reconstructing causality
Teiresias
Alpha
_____ algorithm enables the discovery of rigidity in biological sequences.
Alpha
Teiresias
The α-algorithm is an algorithm used in data mining aimed at reconstructing causality from set of sequences of events.
Alpha
Teiresias
This is designed to operate on database containing transaction for example a collection of items bought by a customer or details of website frequentation. It uses bottom up approach, where frequent subsets are extended one item at a time called candidate generation and groups of candidates are tested against the data.
Alpha
Teiresias
Apriori Algorithm
and this is used for sequence mining. This approach uses priori or level-wise algorithm. One way of using level-wise paradigms is discovering or identifying the frequent items.
Alpha
Teiresias
Teiresias
GSP Generalized Sequential Algorithm Algorithm
A combinatorial algorithm for the discovery of rigid patterns (motifs)in biological sequences, this uses regular expressions to define the patterns, this allows the patterns reported to consist not only of the characters that appear in each position (literals)
Teiresias
Alpha
GSP Generalized Sequential Algorithm Algorithm
B follows a but a never follows b. B depends on a! Written as a->b
Temporal independence
Temporal dependency
Alpha algorithm
Teiresias
There is a trace where a follows b but also a trace where b follows a. A and b can be executed in parallel! Can be written as a || b
Independence
Temporal dependency
Temporal independence
There is no trace where a follows b or b follows a. A and b are independent. Can be written as a # b
Temporal independence
Temporal dependency
Independence
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the
DataFrame.agg()
Groupby()
Method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.
DataFrame.agg()
Groupby()
Split-apply-combine pattern:
Split the data into groups Apply a function to each group independently Combine the results into a data structure
>>titanic.groupby("Sex")["Age"].mean()
Average age of the Titanic passengers?
>>titanic["Age"].mean()
>>titanic[["Age", "Fare"]].median()
Median age and ticket fare price of the Titanic passengers?
>>titanic["Age"].mean()
>>titanic[["Age", "Fare"]].median()
The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function
Titanic[["Age", "Fare"]].describe()
>>titanic["Age"].mean()
The average age for male versus female Titanic passengers
>>titanic[["Sex", "Age"]].groupby("Sex").mean()
Titanic[["Age", "Fare"]].describe()
The apply and combine steps are typically done together in pandas. In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns:
>>titanic.groupby("Sex").mean()
>>titanic.groupby("Sex")["Age"].mean()
It does not make much sense to get the average value of the Pclass. If we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well: So we do the following
>>titanic.groupby("Sex")["Age"].mean()
>>titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
What is the mean ticket fare price for each of the sex and cabin class combinations?
>>titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
>>titanic.groupby("Sex")["Age"].mean()
What is the number of passengers in each of the cabin classes?
>>titanic["Pclass"].value_counts()
>>titanic.groupby("Pclass")["Pclass"].count()
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:
>>titanic["Pclass"].value_counts()
>>titanic.groupby("Pclass")["Pclass"].count()
Both ____ and ___ can be used in combination with groupby. Whereas size includes NaN values and just provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use the dropna argument to include or exclude the NaN values.
Size and quantity
Size and count
Number and size
Facilitate the online analytical processing of multidimensional data.
Database
Data cubes
Data Warehouse
A data cube is
Group of cuboids
Lattice of cuboids
Member of cuboids
The _____ cuboid is the least generalized of all the cuboids in the data cube.
Upper
Base
Lower
The most generalized cuboid is the ___, commonly represented as all
Latest cuboid
Apex cuboid
Top cuboid
A cell in the base cuboid is a
Cellular
Base cell
Standard cell
An ______ aggregates over one or more dimensions, where each aggregated dimension is indicated by a ∗ in the cell notation.
Aggregate cell
Base cell
We say that a is ________ (i.e., from an m-dimensional cuboid) if exactly m (m ≤ n) values among { a 1 , a 2 ,..., a n } are not ∗. If m = n, then a is a base cell; otherwise, it is an aggregate cell (i.e., where m < n).
N-dimensional cell
M-dimensional cell
Is the task of grouping objects in a manner where the same group is more similar to other groups in some manner, this is used in different fields such as pattern recognition, image analysis, bioinformatics, information retrieval and computer graphics.
Cluster analysis
Database Analysis
Classifying
Is the process of partitioning a set of data objects (or observations) into subsets. Where each subset is a ____
Clustering, Cluster
Grouping, group
Defining, Definition
Hierarchical clustering builds model based on distance connectivity.
Density models:
Connectivity models:
� Centroid model:
Distribution model:
for example, the k-means algorithm represents each cluster by a single mean vector.
Centroid model:
Density models:
Distribution model:
Connectivity models:
Cluster are modeled using statistical distribution, such as multivariate normal distributions used by the Expectation-maximization algorithm.
Centroid model:
Distribution model:
Density models:
In Bi-clustering also known as co-clustering or two-mode- clustering which are modeled with both cluster members and relevant attributes.
Subspace models:
Group models:
With some algorithms do not provide a refined model instead it gives a grouping information.
Group models:
Graph-based models:
S subset of nodes in a graph that the nodes are connected by an edge.
Graph-based models:
Group models:
Also referred to as soft clustering, is a form of clustering in which each data point can belong to more than one cluster.
Fuzzy Clustering:
K-Means Clustering:
A method of vector quantization. This algorithm has loose relationship the k-nearest neighbor classifier.
Fuzzy Clustering:
K-Means Clustering:
A variation of k-means, it is a cluster analysis algorithm, has an effect of minimizing errors over all clusters with reference to -norm distance metric.
K-Means Clustering:
Fuzzy Clustering:
K-Median Clustering:
Another method for hierarchical clustering, it is based on grouping clusters in bottom up fashion (agglomerative clustering), each cluster ids represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different cluster.
Single Linkage Clustering:
K-Median Clustering:
K-Means Clustering:
Group of methods that are effective in high dimensional data applications.
K-Median Clustering:
K-Means Clustering:
Spectral Clustering:
Fuzzy Clustering:
A method of cluster analysis which seeks to build a hierarchy of clusters. There are 2 types (Bolton, 2017):
Hierarchical Clustering:
Spectral Clustering:
K-Median Clustering:
K-Means Clustering:
Bottom up approach, each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy.
Agglomerative:
Divisive:
The opposite of agglomerative approach, thus this top down approach, each observation start in one cluster and splits are performed recursively as one moves down the hierarchy.
Agglomerative:
Divisive:
Utilizing different statistical techniques from data mining, data modelling, machine learning, the current flow of data can be exhausted to provide information something unknown but due to the availability of data and use of different techniques a business owner can come up with an information of what likely is going to happen.
Predictive Analytics:
Decision Support System:
An application that uses massive of data to support determinations, judgments, and courses of action by providing information for decision making, the information can be use by computer itself or whoever in the organization has the authority to carry it out.
Decision Support System:
Web Mining:
Refers to the application of data mining techniques to discover patterns, structures, and knowledge from the Web. According to analysis targets, web mining can be organized into three main areas: web content mining, web structure mining, and web usage mining.
Web Mining:
Decision Support System:
Let a = (a 1 , a 2 ,..., a n , measures)
N dimensional data cube
M dimensional data cube
(Jan, ∗ , ∗ , 2800) and (∗, Chicago, ∗ , 1200) are 1-D cells; (Jan, ∗ , Business, 150) is a 2-D cell; and (Jan, Chicago, Business, 45) is a 3-D cell. Here, all base cells are 3-D, whereas 1-D and 2-D cells are aggregate cells.
Base and aggregate cells. Consider a data cube with the dimensions month, city, and customer group, and the measure sales.
Ayan na yung sagot
{"name":"Data Mining", "url":"https://www.quiz-maker.com/QPREVIEW","txt":"generalize and consolidate data in multidimensional space., Olap?, Also called rolled up operations performs data aggregations that can be computed in many dimensions","img":"https://www.quiz-maker.com/3012/images/ogquiz.png"}