The Data Scientist: Finding the right tools for the job



Which technologies and tools look best on a resume? Do you need to know HBase, Cassandra, MySQL, Excel, SPSS, R or SAS? Professionals advise: all of the above.

Data analysts have many tools at their disposal, from linear regression to classification trees to random forests, and these tools have all been carefully implemented on computers. But ultimately, it takes a data analyst—a person—to find a way to assemble all of the tools and apply them to data to answer a question of interest to people.

While working as the chief data scientist at Facebook, Jeff Hammerbacher described how, on any given day, a scientist team would utilize Python, R and Hadoop, and then have to relay the analyses to colleagues. Additionally, a recent SiSense data professionals study found that 60 percent of respondents use three or more data warehouse and business intelligence interfaces.

The size of the data is growing rapidly at the same time we have lots of tools to deal with that data. We can categorize the software/tools based on the tasks and data that they can deal with. We can classify the available tools in the market based one request type as Reports and Dashboard development Tools, Statistical Packages and BI Tools.

Tools for Data Analysis – Reports and Dashboard development Tools

Report generation and Dashboard development is a daily task for any organization. They want to understand the data on timely basis. They will generate the reports daily, weekly, monthly, yearly and ad-hoc reports and dashboards. There are many tools available in the market like MS Excel, Tableau, QlikView, Spotfire. These tools are only for develop the reports and dashboards, they are also adding the capabilities or functionality to perform on statistical analysis. Excel is spreadsheet application. And we can say Tableau, QlikView, Spotfire are BI Tools. They have powerful engines to perform ETL.

Tools for Data Analysis – Statistical Packages

SAS and SPSS are the leaders to perform the advanced statistical modelling. We also have many other open source tools like R to perform statistical analysis. You have many bulit-in procedures and tools to deal with your data. You can do extracting the data, data cleaning, formatting the data, Tabulation and statistical analysis. And you can SAS and SPSS are BI Tools as they have very powerful modules to perform ETL. (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.

Tools for Data Analysis – BI Tools

MS SQL Server, Oracle, SAP, Microstrategy, Spotfire, QlikView, SAS, SPSS Modeler are the leaders in this market. They can deal with large amount of data and perform the ETL and drill-down Analysis. We will see more about BI concepts in a separate topic.

Thankfully, there are ample resources on the Web to develop and hone your skills. Big Data University, for example, offers free resources to help data professionals gain proficiency in JAQL, MapReduce, Hive, Pig and others.

It’s also important to gain experience using these skills in the “real world.” Gopinathan advises aspiring data scientists to participate heavily in open-source projects and data contests, such as Kaggle, to practice utilizing technical, scientific and visual skills in real business scenarios.

In summary, to make the transition from BI specialist to data scientist is going to require the following new skills and capabilities:

  • Deep dive into the multitude of statistical and predictive analytics models. Without a doubt, you’re going to have to get out your college statistics and advanced statistics books and spend time learning how and when to apply the right analytic models given the business situation.
  • Learning new analytic tools like R, SAS and MADlib. R, for example, is an open source product for which lots of tools (like RStudio) and much training is available free and on-line.
  • Learning more about Hadoop and related Hadoop products like HBase, Hive and Pig. There is no doubt that Hadoop is here to stay, and there will be a multitude of opportunities to use Hadoop in the data preparation stage.  It’s the perfect environment for adding structure to unstructured data, performing advanced data transformations and enrichments, and profiling and cleansing your data coming from a multitude of data sources.

As you make your way through the world of data science, learning R programming and other important skills, it’s important to remember that data science isn’t just a collection of tools.

It requires a person to apply those tools in a smart way to produce results that are useful to people. Choosing the right modelling approach is often a creative exercise that demands expert human judgment.

No matter what type of company you’re interviewing for, you’re likely going to be expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL.

One thought on “The Data Scientist: Finding the right tools for the job”

Leave a Reply

Your email address will not be published. Required fields are marked *