An Apache Hadoop infrastructure offers great promise for drastically reducing the costs of storing and processing large data volumes and increasing ROI. However, Hadoop cannot deliver the opportunities by using only its infrastructure. Success with Hadoop requires effectively managing data movement, data transformation and integration, data cleansing, data governance, data security, data privacy, and data analytics and reports.
Many organizations are considering implementing a data lake solution. This is a set of one or more data repositories that have been created to support data discovery, analytics, ad hoc investigations, and reporting. Without proper management and governance, a data lake can quickly become a data swamp. IBM® proposes an enhanced data lake solution that is built with management, affordability, and governance at its core. This solution is known as a data reservoir. A data reservoir provides the right information about the sources of data available in the lake and their usefulness in supporting users who are investigating, reporting, and analyzing events and relationships in the reservoir.
This IBM Redbooks® Solution Guide introduces IBM InfoSphere® Information Server, which provides an integrated set of tools that are built to handle the extreme throughput and governance required by today’s demanding business enterprises. It addresses the practical realities of managing the data integration tasks that are required for success with Hadoop. Managing these data integration tasks effectively in the Hadoop environment is one critical step in supporting a data reservoir instead of creating a data swamp.
The material included in this document is in DRAFT form and is provided 'as is' without warranty of any kind. IBM is not responsible for the accuracy or completeness of the material, and may update the document at any time. The final, published document may not include any, or all, of the material included herein. Client assumes all risks associated with Client's use of this document.