Governance: The Who, What, When, Why, Where and How of Access Data

Data can be one of the most powerful tools to improve customer experiences and increase customer acquisition and retention. Nevertheless many businesses are feeling the pressures of data overload. Documenting business objectives helps determine what data should be captured, how the data is related, and how it should be structured to transform your data into useful information.

Master Data Management is a comprehensive platform that delivers consolidated, consistent and authoritative master data across the enterprise and distributes this master information to all operational and analytical applications. Its capabilities are designed for mastering data across multiple domains ranging from: Customer, Supplier, Site, Account, Asset and Product including many others.

Data Management Association (DAMA) has identified 10 major functions of Data Management in the DAMA-DMBOK (Data Management Body of Knowledge).

Data Governance is identified as the core component of Data Management, tying together the other 9 disciplines, such as Data Architecture, Data Quality, Reference & Master Data, Data Security, Database operation, Data development, Meta-data, Document & Content, data warehousing & BI.


Data Management Association (DAMA) Data Governance Framework

Effective data governance serves an important function within the enterprise, setting the parameters for data management and usage, creating processes for resolving data issues and enabling business users to make decisions based on high-quality data and well-managed information assets. But implementing a data governance framework isn’t easy. Complicating factors often come into play, such as data ownership questions, data inconsistencies across different departments and the expanding collection and use of big data in companies.

The essential WHO-WHAT-WHEN-WHERE-WHY-HOW information about Data Governance


WHAT does Data Governance mean, and what does it do?

Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.

DAMA defines Data Governance as: “The exercise of authority, control and shared decision-making (planning, monitoring and enforcement) over the management of data assets. Data Governance is high-level planning and control over data management.”

According to the Data Governance Institute, “Data Governance is a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”

WHO is involved with Data Governance?

Data Governance is of concern to any individual or group who has an interest in how data is created, collected, processed and manipulated, stored, made available for use, or retired. We call such people Data Stakeholders.

WHEN do organizations need formal Data Governance?

Organizations need to move from informal governance to formal Data Governance when one of four situations occur:

  • The organization gets so large that traditional management isn’t able to address data-related cross-functional activities.
  • The organization’s data systems get so complicated that traditional management isn’t able to address data-related cross-functional activities.
  • The organization’s Data Architects, SOA teams, or other horizontally-focused groups need the support of a cross-functional program that takes an enterprise (rather than siloed) view of data concerns and choices.
  • Regulation, compliance, or contractual requirements call for formal Data Governance.

WHERE in an organization are Data Governance Programs located?

This varies. They can be placed within Business Operations, IT, Compliance/Privacy, or Data Management organizational structures. What’s important is that they received appropriate levels of leadership support and appropriate levels of involvement from Data Stakeholder groups.

WHY use a formal Data Governance Framework?

Frameworks help us organize how we think and communicate about complicated or ambiguous concepts. The use of a formal framework can help Data Stakeholders from Business, IT, Data Management, Compliance, and other disciplines come together to achieve clarity of thought and purpose.

HOW do we assess whether we are ready for Data Governance?

Data Governance Maturity Assessment allow to measure the current state, determine both interim and long-term goals for improvement, provide the best practices that will move them to the next stage and assess their progress at any point in the process. As every organisation differs in their business, their systems, management style and so forth, performing the Data Governance Maturity Assessment will help in designing both the short and long-term goals for implementing a Data Governance program that is tailored for the organisation. To get the most out of your data warehouse and business intelligence implementation, a Data Governance Maturity Assessment should be performed.

HOW does an organization “do” Data Governance?

Data Governance programs tend to start by focusing their attention on finite issues, then expanding their scope to address additional concerns or additional sets of information. And so, the establishing of Data Governance tends to be an iterative process; a new area of focus may go through all of the steps described above, at the same time that other governance-led efforts are well-established in the “govern the data” phase.

In other words, a data governance framework assigns ownership and responsibility for data, defines the processes for managing data, and leverages technologies that will help enable the aforementioned people and processes.

The objectives of data governance are to:

  1. Enable better decision-making
  2. Reduce operational friction
  3. Protect the needs of data stakeholders
  4. Train management and staff to adopt common approaches to data issues
  5. Build standard, repeatable processes
  6. Reduce costs and increase effectiveness through coordination of efforts
  7. Ensure transparency of processes
  8. Ensure a single version of the truth for your organization


Data Governance Focus Areas

Data governance touches various components of enterprise information management and will have a different set of objectives and implementation approach while taking on a focus in one these specific areas.

Data Governance programs with different focus areas will, however, differ in the type of rules and issues they’ll address. They’ll differ in the emphasis they give to certain data-related decisions and actions. And, they’ll differ in the level of involvement required of types of data stakeholders.

Data Governance with Focus on Data Quality

The most common objective of Data Governance programs is to standardize data definitions across an enterprise. Quality needs to be a mandatory piece of a larger governance strategy. Without it, your organization is not going to successfully manage and govern its most strategic asset: its data. Any good active data governance methodology should let you measure your data quality. This is important because data quality actually has multiple dimensions which need to be managed.

Data governance initiatives improve data quality by assigning a team responsible for data’s accuracy, accessibility, consistency, and completeness, among other metrics. This team usually consists of executive leadership, project management, line-of-business managers, and data stewards. The team usually employs some form of methodology for tracking and improving enterprise data, such as Six Sigma, and tools for data mapping, profiling, cleansing, and monitoring data. At the end of the day, data quality and data governance are not synonymous, but they are closely related.

Data Governance with Focus on Privacy / Compliance / Security

The digital era has created unprecedented opportunities to conduct business and deliver services over the Internet. Nevertheless, as organizations collect, store, process and exchange large volumes of information in the course of addressing these opportunities, they face increasing challenges in the areas of data security, maintaining data privacy and meeting related compliance obligations.

Big data privacy falls under the broad spectrum of IT governance and is a critical component of your IT strategy. You need a level of confidence in how any data is handled to make sure your organization isn’t at risk of a nasty, often public data exposure. That extends to privacy for all your data, including big data sets that are increasingly becoming part of the mainstream IT environment.

Cyber Security – Companies of all sizes need to:

  • Understand who can access which types of data, via what means, and within what parameters (time of day, department, location, and many more)
  • Determine what data is sensitive
  • Review and authorize access
  • Monitor who is actually accessing the data
  • Detect unauthorized access in real-time
  • Track data access patterns
  • Be able to perform forensics after the fact

Data Governance with a Focus on Data policy, standards, strategies

The focus on data policies, data standards, and overall data strategies are usually the first step when an organization initiates a data governance function.

Data Governance with a Focus on Data Warehouses and Business Intelligence (BI)

This type of program typically comes into existence in conjunction with a specific data warehouse, data mart, or BI tool. These types of efforts require tough data-related decisions, and organizations often implement governance to help make initial decisions, to support follow-on decisions, and to enforce standards and rules after the new system becomes operational.

Data Governance with a Focus on Architecture / Integration

This type of program typically comes into existence in conjunction with a major system acquisition, development effort, or update that requires new levels of cross-functional decision-making and accountabilities.

Data Governance with a Focus on Management Support

Data Governance programs with a focus on Management Support typically come into existence when managers find it difficult to make “routine” data-related management decisions because of their potential effect on operations or compliance efforts.

It is important to recognise that data governance is not an IT function. Accountants can play a key role in enabling Data Governance, and ensuring that it is aligned with an organization’s overall corporate governance processes. Accountants already are familiar with applying many of the principles above to the financial data that they work with in a regular basis.

Becoming involved in a data management or data governance initiatives provides the opportunity to apply these principles into other parts of the organization. Developing a successful data governance strategy requires careful planning, the right people and appropriate tools and technologies. IT is a member of the data governance board but any effective data governance program requires executive sponsorship and business involvement.




The Data Scientist: Finding the right tools for the job



Which technologies and tools look best on a resume? Do you need to know HBase, Cassandra, MySQL, Excel, SPSS, R or SAS? Professionals advise: all of the above.

Data analysts have many tools at their disposal, from linear regression to classification trees to random forests, and these tools have all been carefully implemented on computers. But ultimately, it takes a data analyst—a person—to find a way to assemble all of the tools and apply them to data to answer a question of interest to people.

While working as the chief data scientist at Facebook, Jeff Hammerbacher described how, on any given day, a scientist team would utilize Python, R and Hadoop, and then have to relay the analyses to colleagues. Additionally, a recent SiSense data professionals study found that 60 percent of respondents use three or more data warehouse and business intelligence interfaces.

The size of the data is growing rapidly at the same time we have lots of tools to deal with that data. We can categorize the software/tools based on the tasks and data that they can deal with. We can classify the available tools in the market based one request type as Reports and Dashboard development Tools, Statistical Packages and BI Tools.

Tools for Data Analysis – Reports and Dashboard development Tools

Report generation and Dashboard development is a daily task for any organization. They want to understand the data on timely basis. They will generate the reports daily, weekly, monthly, yearly and ad-hoc reports and dashboards. There are many tools available in the market like MS Excel, Tableau, QlikView, Spotfire. These tools are only for develop the reports and dashboards, they are also adding the capabilities or functionality to perform on statistical analysis. Excel is spreadsheet application. And we can say Tableau, QlikView, Spotfire are BI Tools. They have powerful engines to perform ETL.

Tools for Data Analysis – Statistical Packages

SAS and SPSS are the leaders to perform the advanced statistical modelling. We also have many other open source tools like R to perform statistical analysis. You have many bulit-in procedures and tools to deal with your data. You can do extracting the data, data cleaning, formatting the data, Tabulation and statistical analysis. And you can SAS and SPSS are BI Tools as they have very powerful modules to perform ETL. (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.

Tools for Data Analysis – BI Tools

MS SQL Server, Oracle, SAP, Microstrategy, Spotfire, QlikView, SAS, SPSS Modeler are the leaders in this market. They can deal with large amount of data and perform the ETL and drill-down Analysis. We will see more about BI concepts in a separate topic.

Thankfully, there are ample resources on the Web to develop and hone your skills. Big Data University, for example, offers free resources to help data professionals gain proficiency in JAQL, MapReduce, Hive, Pig and others.

It’s also important to gain experience using these skills in the “real world.” Gopinathan advises aspiring data scientists to participate heavily in open-source projects and data contests, such as Kaggle, to practice utilizing technical, scientific and visual skills in real business scenarios.

In summary, to make the transition from BI specialist to data scientist is going to require the following new skills and capabilities:

  • Deep dive into the multitude of statistical and predictive analytics models. Without a doubt, you’re going to have to get out your college statistics and advanced statistics books and spend time learning how and when to apply the right analytic models given the business situation.
  • Learning new analytic tools like R, SAS and MADlib. R, for example, is an open source product for which lots of tools (like RStudio) and much training is available free and on-line.
  • Learning more about Hadoop and related Hadoop products like HBase, Hive and Pig. There is no doubt that Hadoop is here to stay, and there will be a multitude of opportunities to use Hadoop in the data preparation stage.  It’s the perfect environment for adding structure to unstructured data, performing advanced data transformations and enrichments, and profiling and cleansing your data coming from a multitude of data sources.

As you make your way through the world of data science, learning R programming and other important skills, it’s important to remember that data science isn’t just a collection of tools.

It requires a person to apply those tools in a smart way to produce results that are useful to people. Choosing the right modelling approach is often a creative exercise that demands expert human judgment.

No matter what type of company you’re interviewing for, you’re likely going to be expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL.

Vegetarian pizza low popularity rating. Scientific Data Evidence Behind “Superfood” Menu

The idea behind this R project came from our “pizza night” that we had one day… Many thanks to Darren. After 5 meat pizza was eaten and half of one vegetarian pizza left the idea to analyse the pattern was just what I needed. Let’s see at superfood from data scientist’s perspective.

The origin of the “superfood “
The concept of the “superfood” is a popular one when it comes to food and health. The media is full of reports of ultra-healthy foods, from blueberries and beetroot to cocoa and salmon. These reports claim to reflect the latest scientific evidence, and assure us that eating these foods will give our bodies the health kick they need to stave off illness and aging. But is there any truth to such reports?

Despite its ubiquity in the media, however, there is no official or legal definition of a superfood. The Oxford English dictionary, for example, describes a superfood as “a nutrient-rich food considered to be especially beneficial for health and well-being”, while the Merriam-Webster dictionary omits any reference to health and defines it as “a super nutrient-dense food, loaded with vitamins, minerals, fibre, antioxidants, and/or phytonutrients”.

Criticism of the nomenclature

“There’s no such thing as a superfood. It’s nonsense: just one of those marketing terms,” says University College Dublin professor of nutrition Mike Gibney, throwing on the garb of Ireland’s superfood Grinch. “There is no evidence that any of these foods are in any way unusually good.”

“The European Food Safety Authority was created because the consumer was being conned by marketing people,” says Gibney. The authority bans health claims lacking scientific evidence, so you might find amazing health claims about superfoods in books and on websites, but you won’t on supermarket shelves.

What is the evidence? 
In order to distinguish the truth from the hype, it is important to look carefully at the scientific evidence behind the media’s superfood claims. So what data should we use for analysis? What dimensions can be considered scientific?

Food supplements

The idea behind food supplements, also called dietary or nutritional supplements, is to deliver nutrients that may not be consumed in sufficient quantities. Food supplements can be vitamins, minerals, amino acids, fatty acids, and other substances delivered in the form of pills, tablets, capsules, liquid, etc. Supplements are available in a range of doses, and in different combinations. However, only a certain amount of each nutrient is needed for our bodies to function, and higher amounts are not necessarily better. At high doses, some substances may have adverse effects, and may become harmful. For the reason of safeguarding consumers’ health, supplements can therefore only be legally sold with an appropriate daily dose recommendation, and a warning statement not to exceed that dose.

There is a lot of legislation concerning food supplements in Europe and America. Let’s start form Vitamins and minerals, as government Food Safety Authority of Ireland has issued Guidance Note No. 21 “Food Supplements Regulations and Notifications ”.


Taking microelements as a reference point I created a table form the data source “Categories for Food Nutrition Labels” and rank foods by nutrient density.

% DV per 100g calcium Iron Magnesium Phosphorus Sodium Potassium Zinc Copper Manganese Selenium
Cola Carbonated beverage without caffeine 0% 0% 0% 1% 0% 0% 0% 0% 0% 0%
Apples raw with skin 1% 1% 1% 1% 0% 3% 0% 1% 2% 0%
Alcoholic beverage beer regular BUDWEISER 0% 0% 2% 1% 0% 1% 0% 0% 0% 0%
Tea black brewed prepared with tap water 0% 0% 1% 0% 0% 1% 0% 1% 11% 0%
Lamb domestic shoulder whole  lean 1/4Inch fat choice raw 2% 9% 6% 18% 3% 8% 32% 5% 1% 32%
Beef bottom sirloin roast  lean and fat trimmed to 0Inch fat raw 2% 8% 5% 19% 2% 9% 24% 4% 1% 34%
Fish salmon Atlantic farmed raw 1% 2% 7% 24% 2% 10% 2% 2% 1% 34%
Chicken broilers or fryers leg meat and skin raw 1% 4% 5% 16% 4% 6% 10% 3% 1% 26%
Chicken broiler or fryers breast skinless boneless meat only raw 1% 2% 7% 21% 2% 10% 5% 2% 1% 33%
Bread pumpernickel 7% 16% 14% 18% 25% 6% 10% 14% 65% 35%
Bread wheat 13% 19% 11% 15% 21% 5% 8% 8% 59% 41%
Bread italian 8% 16% 7% 10% 26% 3% 6% 10% 23% 39%
Bread white commercially prepared (includes soft bread crumbs) 14% 20% 6% 10% 20% 4% 5% 5% 27% 31%
Wheat flour whole-grain 3% 20% 34% 36% 0% 10% 17% 21% 203% 88%
Wheat soft white 3% 30% 23% 40% 0% 12% 23% 21% 170% 0%
Wheat bran crude 7% 59% 153% 101% 0% 34% 48% 50% 575% 111%
Wheat germ crude 4% 35% 60% 84% 1% 25% 82% 40% 665% 113%
Spices curry powder 53% 106% 64% 37% 2% 33% 31% 60% 415% 58%
Egg whole raw fresh 6% 10% 3% 20% 6% 4% 9% 4% 1% 44%
Lettuce iceberg (includes crisphead types) raw 2% 2% 2% 2% 0% 4% 1% 1% 6% 0%
Oranges raw all commercial varieties 4% 1% 3% 1% 0% 5% 0% 2% 1% 1%
Pineapple raw all varieties 1% 2% 3% 1% 0% 3% 1% 6% 46% 0%
Bananas raw 1% 1% 7% 2% 0% 10% 1% 4% 14% 1%
Potatoes flesh and skin raw 1% 4% 6% 6% 0% 12% 2% 5% 8% 0%
Rice white short-grain raw 0% 24% 6% 10% 0% 2% 7% 11% 52% 22%
peppers 1% 2% 3% 2% 0% 5% 1% 3% 6% 0%
mushroom 0% 2% 0% 11% 0% 10% 4% 14% 3% 27%
onion raw 2% 1% 1% 3% 0% 4% 1% 2% 6% 1%
broccoli raw 5% 5% 6% 7% 1% 9% 3% 2% 11% 4%

“R” paradigm:

To make the data more visually effective we used R. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R’s popularity has increased substantially in recent years.

R and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. Many of R’s standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.

R’s data structures include vectors, matrices, arrays, data frames (similar to tables in a relational database) and lists. The capabilities of R are extended through user-created packages, which allow specialized statistical techniques, graphical devices (ggplot2), import/export capabilities, reporting tools (knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C, C++ and Fortran. A core set of packages is included with the installation of R, with more than 5,800 additional packages (as of June 2014) available at the Comprehensive R Archive Network (CRAN), Bioconductor, Omegahat, GitHub and other repositories.

The “Task Views” page (subject list) on the CRAN website lists a wide range of tasks (in fields such as Finance, Genetics, High Performance Computing, Machine Learning, Medical Imaging, Social Sciences and Spatial Statistics) to which R has been applied and for which packages are available. R has also been identified by the FDA as suitable for interpreting data from clinical research.

R application:

I installed R 3.2.1 version on my computer. I prepared my list of food table in Excell, so before I start working on my homework it was necessary to import my data in comma separated values (CSV) – R compatible format.

The code samples above assume the data files are located in the R working directory, which can be found with the function getwd. You can select a different working directory with the function setwd(), and thus avoid entering the full path of the data files. Note that the forward slash should be used as the path separator even on Windows platform.

> getwd()

> setwd(“<new path>”)

> setwd(“C:/MyDoc”)

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

To add package follow these steps:

Download and install a package (you only need to do this once).

To use the package, invoke the library (package) command to load it into the current session. (You need to do this once in each session, unless you customize your environment to automatically load it each time.)

On MS Windows:

  • Choose Install Packages from the Packages menu.
  • Select a CRAN Mirror. (e.g. Ireland)
  • Select a package. (e.g. gplots)
  • Then use the library (package) function to load it for use. (e.g. library(gplots))

To visualise my data I used this code in R:

data <- read.csv(“microelement food list post.csv”)


rnames <- data[,1]

mat_data <- data.matrix(data[,2:11])

rownames(mat_data) <- rnames

my_palette <- colorRampPalette(c(“white”, “black” ))(n = 650)

data_heatmap <- heatmap(mat_data, Colv=NA, Rowv=NA, col = my_palette, scale=”none”, )


I saved my file in JPEG format 100% quality:



Then I played with different colours and palettes:

my_palette <- colorRampPalette(c(“snow”, “yellow”, “orange”, “brown”, “black” ))(n = 650)

data_heatmap <- heatmap(mat_data, Colv=NA, Rowv=NA, col = my_palette, scale=”none”, )


Heatmap in red was created with color brewer palette. Results visualised differently from previous, as only this palette uses only 9 colours and data position in the table was interchanged automatically.

heatmap(mat_data,Colv=NA,col=brewer.pal(9,”Reds” ))heatmap4

Why vegetarian pizza lost points comparing to meat alternatives?

Generally speaking, superfoods refer to foods — especially fruits and vegetables — whose nutrient content confers a health benefit above that of other foods. Let’s see does fruits and vegetables has more nutritional benefits than meat, fish or spices?

As we can see from the table vegetables has much lower nutritional value then meat or grains.

From this table we can see real “Superfood” that are wheat germ, curry spice and wheat bran. However different food has a different microelements density: for example from the meat in comparison table lamb is much higher in zinc.

The time for this project is limited. By adjusting data, comparing like with like (raw with raw, oils and fats, etc) we can find a superfood in each category.





“IN-Fusion TableT” – THE GOOGLE WAY


Google Fusion Tables (or simply Fusion Tables) is a web service provided by Google for data management. Fusion tables can be used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download. In the 2011 upgrade of Google Docs, Fusion Tables became a default feature, under the title “Tables (beta)”.

Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100 MB per data set, 250 MB of data per user. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to different users. Users can keep the data private, share it with a select set of collaborators, or make it public and thus crawlable by search engines.

The discussion feature of Fusion Tables allows collaborators to conduct detailed discussions of the data at the level of tables and individual rows, columns, and cells. HTML is useful for styling info boxes and adding more complex features. Fusion Tables maps have limited options and functionality compared to custom mapping applications, but they are far easier to build. Fusion Tables does not require knowledge of JavaScript or CSS to make online maps.

Fusion Power of Visualisation

  • Upload and manage map data
  • Map points, lines or areas
  • Create pushpin, intensity, and other types of maps
  • Create other types of visualizations (charts)
  • Embed your visualizations in a Web site
  • Share and collaborate with others


A Science of Data-Visualization Storytelling

Data visualization is viewed by many disciplines as a modern equivalent of visual communication. A primary goal of data visualization is to communicate information clearly and efficiently to users via the statistical graphics, plots, information graphics, tables, and charts selected. Effective visualization helps users in analysing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable.

Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly information-based economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as “Big Data”. Processing, analysing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge. Well-crafted data visualization helps uncover trends, realize insights, explore sources, and tell stories.


However, sometimes visualization tools may require technical knowledge or are just too expensive. That’s why I thought about using Google Fusion Tables to provide a few complementary visualizations to Google Analytics – it is a great tool, very user friendly, and free. Google Fusion Tables provides means for visualizing data with pie charts, bar charts, lineplots, scatterplots, timelines, and geographical maps. Google provide a quick step-by-step guide to use Fusion Tables to visualize Google Analytics data: how to bring the data, prepare it, and visualize it using great charts.

THEmatic WEB MAPping

One of the coolest features of Fusion tables is their ability to interface with Google Maps. If a table contains geographical location data, it can be made into a layer for the Google Maps API, allowing you to visualize your data geographically. The display of information can be customized to make sure you’re getting the best visualization of your data.


The quickest and easiest ways to produce simple maps for your Web site is to use Google’s Fusion Tables. Fusion Tables is an online data management application designed for collaboration, visualization and publication of data. Journalists often want to create thematic web maps, in which geographic areas are filled in with colour/shade according to data values. Thanks to Google Fusion Tables, creating basic thematic maps and embedding them on a web page is now easy.

Web mapping is widely used by government statistical agencies. The Irish CSO has an option of web mapping from statistical data they collect and publish.


Accessed 02 Aug 2015


Ireland Mapping – HIStory or Visual Reality


As a project assignment “Google Fusion tables” we created map of Ireland with counties boundaries and population density accordingly. The first step was to get data in a Fusion Table-friendly format. Statistical data of Census population 2011 was used. Some data cleaning was made to match with data of Ireland county boundaries that was downloaded in KML format.

Both tables with geographic boundary information and Census population data was uploaded to Google drive account. Fusion Tables within a Google Docs account were created by clicking Create –> More –> Fusion Table. Afterwards two tables was merged and new map of Ireland ready to see.

To customise the map, few more steps were made. Be clicking “Configure styles” and then choosing “Fill colour” under “Polygons” we were able to regulate visible density of population and actual colour of the map. The final map with legend shows an interesting story…

map link

Storytelling with Data

The larger towns Dublin, Cork, Galway are service centres but, in addition, usually have industrial, administrative and commercial functions. The main concentration of towns is in the east and south of the country and all of the larger centres grew up as ports. Dublin, the focus of the roads and railways, is situated where the central lowland reaches eastwards to the Irish Sea. It is the chief commercial, industrial, administrative, educational and cultural centre.

Cork city has traditionally been associated with the processing and marketing of agricultural products but it benefits also from the presence of large-scale industrial development around its outer harbour and the use of natural gas from the offshore Kinsale field.

On the west coast, the main city is Limerick, which is located at the lowest crossing place on the river Shannon. It shares in the prosperity of the Shannon Industrial Estate but its harbour facilities are now little used, though significant port and industrial activities are developing westwards along the Shannon estuary. Other significant western urban centre is Galway.

Contrasts and Consequences

Regional imbalances in population trends, employment, income and related social conditions have for long been a feature of Ireland. The most striking traditional contrast is between the more prosperous east and the less developed west, though this twofold distinction is a simplification of a more complex regional pattern. T

The less developed character of the west can be explained mainly in terms of its more difficult physical environment, its remoteness from external influences, markets and financial sources, its heavy dependence on small-farm agriculture and its lower levels of urbanisation and infrastructural provision. The result has been low incomes, high unemployment and underemployment and heavy migration from the area with its social consequences. In recent times inner Dublin and the central districts of other cities have been recognised as problem areas also.

Progress Facilitator: Incentives & Policies

Attempts have been made to counteract regional imbalance since the 1950s, at first focusing exclusively on the west but later promoting western development within a broader regional planning framework. The Irish-speaking Gaeltacht areas have been particularly favoured in welfare promotion. The major initial incentive was the allocation of direct state grants to manufacturing firms locating in the west, and although grant provision was later extended to all parts, a differential was maintained in favour of western areas.

The largest manufacturing concentration of this type is at Shannon, where an industrial estate was developed as part of a plan to promote traffic through the airport. While manufacturing remained the spearhead of regional policy, development efforts in other sectors assumed an increasing regional dimension, as in agriculture, forestry, fishing and tourism. Some decentralisation of government administration has been introduced. In recent years there has been a growing realisation of the role which service industries could play in regional development.

Transition Initiatives

Smart Infrastructure and Smart Cities are key elements of both the Digital Agenda for Europe and the Irish Government’s plan for economic recovery. In addition to the opportunity around job creation and service revenue, there are also wider benefits to the economy.

According to the report “The Global Technology Hub” published by ICT:

Key recommendations for Government:

“Meet the target of doubling the annual output of honours degree ICT undergraduate programmes by 2018”

Key recommendations for industry:

“Support Skillnets programmes and encourage the up-skilling of existing staff”

Key recommendations for academia:

“Increase number of places available for tech-conversion programmes”


Prospective PlansDeveloping a Digital Society

The use of technology throughout society can greatly improve a country’s overall economic performance. Work is on-going to increase the level of Government activity using technology as an enabler in a wide range of areas – from our education system to services for citizens. Notably, Government recently published its Digital Strategy to focus on enhancing the digital and online capabilities of the business community and general public.

Ireland have the goal of making Ireland the most attractive location in the world for ICT Skills availability. Department of Jobs, Enterprise and Innovation published a report for 2014 – 2018 years with the action plan to make Ireland a global leader in ICT talent. One of the main objective of this plan is the increase output of high-level graduates, enhance ICT capacity and awareness in the education system.

This Action Plan is a collaborative effort by Government, the education system and industry to meet the goal of making Ireland the most attractive location in the world for ICT Skills availability. There are a number of challenges faced by the technology industry under the umbrella of education and skills. Ireland is addressing each of these challenges and the below examples demonstrate the improvements to date:

  1. Improving the standard of education in Ireland and increasing the uptake of science, technology, engineering and mathematics (STEM) subjects at all levels in the education system.
  2. Increasing the output of honours level graduates from college level ICT courses.
  3. Maintaining the provision of effective technology conversion courses for those from other disciplines and fields.
  4. The up-skilling of current employees in the technology sector through formal continuous professional development.
  5. The availability of language skills and the ability to attract skilled workers from outside Ireland.

Taking all these actions together, and by working in a collaborative way across Government, State agencies, the education sector and industry, they will ensure that the ICT sector in Ireland continues to thrive with benefits for everyone in our society.


The most densely populated areas have the largest Irish towns Dublin, Cork, Galway as their centres. They are the main commercial, industrial, administrative, educational and cultural places. From the education map of Ireland we can see that most of higher education system are concentrated in Dublin and other deeply populated places on the map. One of the reason why more people live in these places are Colleges and Universities.

data legend

education map


private college data mg college

However, as we can see from the map, Private Higher Education Institutions only situated in Dublin.


Irish Government continues to develop areas and sectors that is critical to the on-going recovery and growth of the Irish economy. National Digital Strategy and enhancing ICT capacity are key priorities and outlined in the Action Plan.

From this perspective private colleges and DBS in particular have a potential opportunity to grow in other deeply populated counties of Ireland in upcoming years. High Dublin rents is a contributing factor for local education.



3 V’s and Beyond – The Missing V’s in Big Data?

Big data represents the newest and most comprehensive version of organizations’ long-term aspiration to establish and improve their data-driven decision-making. Data in itself is not valuable at all. The value is in how organisations will use that data and turn their organisation into an information-centric company that relies on insights derived from data analyses for their decision-making.

The early detection of the Big Data characteristics can provide a cost effective strategy to many organizations to avoid unnecessary deployment of Big Data technologies. The data analytics on some data may not require Big Data techniques and technologies; the current and well established techniques and technologies maybe sufficient to handle the data storage and data processing. This brings us to the purpose of the characteristics of Big Data to help with identifying if a problem requires a Big Data solution.


According to Gartner big data definition is:

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

There are differing opinions with the number of characteristics – “V dimensions” are needed to identify a project as ‘Big Data’. The original three V’s – Volume, Velocity, and Variety – appeared in 2001 when Gartner analyst Doug Laney used it to help identify key dimensions of big data.

3-D Data Management

  1. Volume – The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second. All the emails, twitter messages, photos, video clips, sensor data etc. we produce and share every second. Currently, the data is generated by employees, partners, machines and customers. For example, hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago. More sources of data with a larger size of data combine to increase the volume of data that has to be analysed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
  2. Velocity – is the speed at which the data is created, stored, analysed and visualized. Big data technology allows us now to analyse the data while it is being generated, without ever putting it into databases. Initially, companies analysed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
  3. Variety – Nowadays, 90% of the data that is generated by organisation is unstructured data. From excel tables and databases, data structure has changed to lose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational databases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. As new applications are introduced new data formats come to life.

The three V’s are the driving dimensions of Big Data, but they are open-ended. There is no specific volume, velocity, or variety of data that constitutes big. These may be the most common but by no means the only descriptors that have been used.



Quantifying ‘Big’ – How Many “V’s” in Big Data?

There are many different characteristics of Big Data on which data scientists agree, but none which by themselves can be used to say that this example is Big Data and that one is not. In fact I was able to find another eleven different characteristics claimed for Big Data. These characteristics were compiled from several sources including IBM, Paxata, Datafloq, SAS, Data Science Central and the National Institute of Standards and Technology (NIST) etc.

 4.Value the all-important V, characterizing the business value, ROI, and potential of big data to transform your organization from top to bottom. It is all well and good having access to big data but unless we can turn it into value it is useless. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits.

5. Viability Neil Biehn, writing in Wired, sees Viability and Value as distinct missing Vs numbers 4 and 5. According to Biehn, “we want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses; the secret is uncovering the latent, hidden relationships among these variables.

6. Veracity: This refers to the accuracy, reliability. Veracity has an impact on the confidence data.

7. Variability – means that the meaning is changing (rapidly) dynamic, evolving, spatiotemporal data, time series, seasonal, and any other type of non-static behaviour in your data sources, customers, objects of study, etc.

8. Visualization Making all that vast amount of data comprehensible in a manner that is easy to understand and read.

9. Validity: data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.

10. Venue: distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.

11. Vocabulary: schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.

12. Vagueness: confusion over the meaning of big data (Is it Hadoop? Is it something that we’ve always had? What’s new about it? What are the tools? Which tools should I use? etc.) (Note: Venkat Krishnamurthy Director of Product Management at YarcData introduced this new “V” at the Big Data Innovation Summit in Santa Clara on June 9, 2014.)

13. Virality: Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events.

14. Volatility Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.

How many V’s are enough?

In recent years, revisionists have blown out the count to a too-many, expanding the market space but also creating confusion. They definitely all matter, particularly as we consider designing and implementing processes to prepare raw data into “ready to use” information streams. Reaching a common definition of Big Data is one of the first tasks to tackle.

Bill Vorhies, President & Chief Data Scientist – Data-Magnum, has been working with the US Department of Commerce National Institute for Standards and Technology (NIST) working group developing a standardized “Big Data Roadmap” since the summer of 2013. They elected to stick with Volume, Variety, and Velocity and kicked other dimensions out of the Big Data definition as broadly applicable to all types of data.

As author and analytics strategy consultant Seth Grimes observes in his InformationWeek piece “Big Data: Avoid ‘Wanna V’ Confusion”. In his article he wants to differentiate the essence of Big Data, as defined by Doug Laney’s original-and-still-valid 3 Vs, from derived qualities, proposed by various vendors. In his opinion, the wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes. Conflating inherent aspects with important objectives leads to poor prioritization and planning.

So, the above mentioned consultants believe that Variability, Veracity, Validity, Value etc. aren’t intrinsic, definitional Big Data properties. They are not absolutes. By contrast, they reflect the uses you intend for your data. They relate to your particular business needs. You discover context-dependent Variability, Veracity, Validity, and Value in your data via analyses that assess and reduce data and present insights in forms that facilitate business decision-making. This function, Big Data Analytics, is the key to understanding Big Data.


I’ve explored many sources to bring you a complete listing of possible definitions of Big Data with the goal of being able to determine what a Big Data opportunity is and what’s not. Once you have a single view of your data, you can start to make intelligent decisions about the business, its performance and the future plans.

In conclusion, Volume, Variety, and Velocity still make the best definitions but none of these stand on their own in identifying Big Data from not-so-big-data.  Understanding these characteristics will help you analyse whether an opportunity calls for a Big Data solution but the key is to understand that this is really about breakthrough changes in the technology of storing, retrieving, and analysing data and then finding the opportunities that can best take advantage.



Bernard Marr “Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance”

Harvard Business Review October 2012 Big Data: The Management Revolution by Andrew McAfee and Erik Brynjolfsson


Decisions are only as good as the information on which they are based. The potential damage to service users arising from poor data quality as well as the legal, financial and reputational costs to the organisation are of such magnitude that organisations must be willing to take the time and give the necessary commitment to improve data quality. Every organization today depends on data to understand its customers and employees, design new products, reach target markets, and plan for the future. Accurate, complete, and up-to-date information is essential if you want to optimize your decision making, avoid constantly playing catch-up and maintain your competitive advantage.

Business leaders recognize the value of big data and are eager to analyse it to obtain actionable insights and improve the business outcomes. Unfortunately, the proliferation of data sources and exponential growth in data volumes can make it difficult to maintain high-quality data. To fully realize the benefits of big data, organizations need to lay a strong foundation for managing data quality with best-of-breed data quality tools and practices that can scale and be leveraged across the enterprise.

What can your organisation do to make data quality a success?

Within an organization, acceptable data quality is crucial to operational and transactional processes and to the reliability of business analytics (BA) / business intelligence (BI) reporting.

Confidence in the quality of the information it produces is a survival issue for government agencies around the world. Health information and Quality Authority of Ireland has adopted a business-driven approach to standards for data and information and endorse “Seven essentials for improving data quality” guide:

data quality



Data Quality is central to an effective performance management system throughout the organization. Data quality is a complex measure of data properties from various dimensions and determined by whether or not the data is suitable for its intended use. This is generally referred to as being “fit-for-purpose”. Data is of sufficient quality if it fulfils its intended use (or re-use) in operations, decision making or planning. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating it, standardizing it, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems.

Data Quality Management entails the establishment and deployment of roles, responsibilities, policies, and procedures concerning the acquisition, maintenance, dissemination, and disposition of data. A partnership between the business and technology groups is essential for any data quality management effort to succeed. The business areas are responsible for establishing the business rules that govern the data and are ultimately responsible for verifying the data quality. The Information Technology (IT) group is responsible for establishing and managing the overall environment – architecture, technical facilities, systems, and databases – that acquire, maintain, disseminate, and dispose of the electronic data assets of the organization.

A data quality assurance program.

Data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data: an explicit combination of organization, methodologies, and activities that exist for the purpose of reaching and maintaining high levels of data quality. To make the most of open and shared data, public and government users need to define what data quality means with reference to their specific aim or objectives. They must understand the characteristics of the data and consider how well it meets their own needs or expectations. For each dimension of quality, consider what processes must be in place to manage it and how performance can be assessed.

Data quality control

Data quality control is the process of controlling the usage of data with known quality measurements for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction. Data quality is affected by the way data is entered, stored and managed. Analytics can be worthless, counterproductive and even harmful when based on data that isn’t high quality. Without high-quality data, it doesn’t matter how fast or sophisticated the analytics capability is. You simply won’t be able to turn all that data managed by IT into effective business execution.

Difference between Data and Information

Data and information are interrelated. In fact, they are often mistakenly used interchangeably.

Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized.

When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information.


If the information we derive from the data is not accurate, we cannot make reliable judgments or develop reliable knowledge from the information. And that knowledge simply cannot become wisdom, since cracks will appear as soon as it is tested.

Bad data costs time and effort, gives false impressions, results in poor forecasts and devalues everything else in the continuum.

What are the factors determining data quality?

Understanding the difference between data and information is the key to solving data quality. To be most effective, the right data needs to be available to decision makers in an accessible format at the point of decision making. The quality of data can be determined through assessment against the following internationally accepted dimensions.

In 1987 David Garvin of the Harvard Business School developed a system of thinking about quality of products. He proposes eight critical dimensions or categories of quality that can serve as a framework for strategic analysis: performance, features, reliability, conformance, durability, serviceability, aesthetics, and perceived quality.

Agencies create or collect data and information to meet their operational and regulatory requirements. They will define their own acceptable levels of data quality according to these primary purposes. It is often a mistake to stick with old quality measures when the external environment has changed.

Thus dimensions of quality also differ from user to user: completeness, legibility, relevance, reliability, accuracy, timeliness, accessibility, interpretability, coherence, accessibility, Interpretability and validity. Data also has to be volume manageable, cost effective and coherent. Clearly they are not independent of each other. This will help ensure that an organisation has a good level of data quality supporting the information it produces.

The dimensions contributing to data quality

data q

Master Data Management

A lot of business problems traces back to lack of data governance and poor quality data in the end. Master data management technology can address a lot of these issues, but only when driven by an MDM strategy that includes a vision that supports the overall business and incorporates a metrics-based business case. Data governance and organizational issues must be put front and centre, and new processes designed to manage data through the entire information management life cycle. Only then can you successfully implement the new technology you’ll introduce in a data quality or master data management initiative.

At its recent Master Data Management Summit in Europe, Gartner recommended a structural approach to implementing master data management, beginning with strategy for development and planning, then setting up a process to govern data. Subsequently, this will aid change management of all types and smartly utilize data targeted at strategic business goals. Once set, data management can be measured, monitored and altered to stay on course.

MDM software includes process, governance, policy, standards and tools to manage an organization’s critical data. MDM applications manage customer, supplier, product, and financial data with data governance services and supporting world-class integration and BI components. Data quality is a first step towards MDM, which allows you to start with one application knowing that MDM will be introduced as more applications get into the act.


Data Governance

Effective data governance serves an important function within the enterprise, setting the parameters for data management and usage, creating processes for resolving data issues and enabling business users to make decisions based on high-quality data and well-managed information assets. But implementing a data governance framework isn’t easy. Complicating factors often come into play, such as data ownership questions, data inconsistencies across different departments and the expanding collection and use of big data in companies.

At its core, data governance incorporates three key areas: people, process and technology. In other words, a data governance framework assigns ownership and responsibility for data, defines the processes for managing data, and leverages technologies that will help enable the aforementioned people and processes. At the end of the day, data quality and data governance are not synonymous, but they are closely related. Quality needs to be a mandatory piece of a larger governance strategy. Without it, your organization is not going to successfully manage and govern its most strategic asset: its data.

Any good active data governance methodology should let you measure your data quality. This is important because data quality actually has multiple dimensions which need to be managed. Data governance initiatives improve data quality by assigning a team responsible for data’s accuracy, accessibility, consistency, and completeness, among other metrics. This team usually consists of executive leadership, project management, line-of-business managers, and data stewards. The team usually employs some form of methodology for tracking and improving enterprise data, such as Six Sigma, and tools for data mapping, profiling, cleansing, and monitoring data.

International standard

ISO 8000 is the international standard that defines the requirements for quality data, understanding this important standard and how it can be used to measure data quality is an important first step in developing any information quality strategy.



  1. NSW Government Standard for Data Quality Reporting March 2015
  2. Health information and Quality Authority of Ireland 2012 “What you should know about data quality”