Data Management: February 2014

It is time to start tool based posts with Datastage and sharing knowledge.

Before going deeper on Datastage, it is better to define what is ETL and what is current understanding and position of this technology in the market, for the ones new to the topic.

ETL ("extract, transform, load") is used to define process or supporting tools that are used to pull data out of one source (database, file ...etc), make necessary transformations and load to another database as a target. As I stated in my previous post "Why data integration geting more important?" , there are main requirements that makes us to think on data integration. Mainly,

(-) Data is not a stand alone asset anymore for enterprises or organizations.

(+) Data is a commodity moving around the enterprise and going in to and coming out from other processes, other systems, other enterprises...etc.

To handle this commodity throughout your systems, enterprise and even your whole environment; instead of just worrying about full-compatibility between systems, you need to consider about your data integration capability. Compatibility and bundling of complementary tools is used as a marketing&sales strategy in the develeopment era of information technologies or mainly in computer science. However with the explosion of knowledge and technological develoments, it is almost impossible for a vendor to respond and meet all expectations in the market. Probably you already watch out that vendors more strictly stick to standards and more focused expertise together with partnership approach is being more popular.

With the discusions on Bigdata and NoSQL systems, there are two main ideas on whether ETL will be still in use or not. You can find viewpoints of Phil Shelley, former CTO Sears Holdings, CEO Metascale who has also established his Bigdata consulting firm NPP-Newton Park Partners last year(2013) and James Markarian, CTO Informatica from different sides in InformationWeek article.

It is impossible not to listen Shelley's idea that "since Hadoop came to the enterprise, we are beginning to see the end of ETL as we know it". But there are points that I do not completely agree with Shelley. While he is saying ETL, he is just focusing the technology as it is now but we know that technology is tend to evolve acording to requirements in the market and it is not necessarily to be called as a new technology. Additionally, I do not see each stage of ETL as non-value-added activities. Within the context of relational databases and structured data, with a good design and good performance, you can add value to your data and turn your stand-alone asset to a commodity that can be used for different purposes throughout your organisation with ETL. Shelley might be completely right for Hadoop but I really suspect whether "Hadoop came to enterprise"! Although it comes to enterprise, is it possible to have Hadoop as the only system? He also states "Some subsets of data do have to be moved out of Hadoop into other systems, for specific purposes. However, with a strong and coherent enterprise data architecture, this can be managed to be the exception."

I do not want to make you lost in different articles before startig to learn a tool but I also strongly believe that it is better to understand the requirement and motivation behind any effort. ıt might also be good to have a look at "The State of ETL: Extract, Transform and Load Techology" article written by Alan R. Earls in DataInformed.

Not just for ETL but for all technologies, it is better to take it as a conceptual knowledge and make use of it to understand new technologies. It is more likely that ETL will not exist for too many years as its traditional form but the logic behind it to extract (filter and read, not necesarily to load to a different system) data, transform if necessary and to load (to a new system, to a modeling tool or just to a user interface) to make data available to serve specific purposes will retain.

Hope you will find Datastage posts helpful for your ongoing tasks and to have a vision to get ready for new technologies.

http://www-01.ibm.com/software/data/infosphere/

Data Management

Data Management

Monday, February 24, 2014

ETL Technology in market