IBM BigInsights Security Implementation: Part 1 Introduction to Security Architecture

IBM Redbooks Analytics Support Web Doc

Published 25 August 2016

More options

Rate and comment

Authors: Bharath Devaraju, Shankar Kuchibhotla, Nisanth Simon

Abstract

Big data analytics involves processing large amounts of data that cannot be handled by conventional systems. The IBM® BigInsights® platform processes large amounts of data by breaking the computation into smaller tasks that can be distributed onto several nodes. As this platform is shared by users in different roles (developers, analysts, data scientists, and testers), it introduces the challenge of provisioning access and authorization to the cluster and securing the data.

Big data platforms are an amalgamation of several individual components that are still evolving, and are based on the challenges and requirements that are dictated by the open source community. These components are developed in isolation by independent teams with no forethought of integrating them in a secure way, which results in individual components defining and exposing their own security policies for data and access protection. This inherent lack of single security policy enforcement in big data platforms can be challenging and overwhelming.

This IBM Redbooks® Analytics Support web doc introduces a reference security architecture for the IBM BigInsights solution that is in line with current industry practices. It can be used as a reference document for solution architects and solution implementers. This document applies to IBM BigInsights Version 4.2 and later.

Contents

Big data analytics involves processing large amounts of data that cannot be handled by conventional systems. The IBM BigInsights® platform processes large amounts of data by breaking the computation into smaller tasks that can be distributed onto several nodes. As this platform is shared by users in different roles (developers, analysts, data scientists, and testers), it introduces the challenge of provisioning access and authorization to the cluster and securing the data.

Big data platforms are an amalgamation of several individual components that are still evolving, and are based on the challenges and requirements that are dictated by the open source community. These components are developed in isolation by independent teams with no forethought of integrating them in a secure way, which results in individual components defining and exposing their own security policies for data and access protection. This inherent lack of single security policy enforcement in big data platforms can be challenging and overwhelming.

This IBM Redbooks® Analytics Support web doc introduces a reference security architecture for the IBM BigInsights solution that is in line with current industry practices. It can be used as a reference document for solution architects and solution implementers. This document applies to IBM BigInsights Version 4.2 and later.


Security aspects to consider when designing the security architecture for IBM BigInsights

Securing an IBM BigInsights cluster involves addressing four main security aspects, which are shown in Figure 1:

  • Secure Perimeter
  • Secure Data
  • Access Management
  • Audit

This figure shows the four aspects of IBM BigInsights security
Figure 1. Four aspects of IBM BigInsights security

A secure perimeter can be enforced in the following ways:
  • By authenticating users against LDAP and Kerberos
  • By protecting HTTPS access through the Apache Knox security gateway
  • By isolating the data nodes in a secure private network

Secure data can be accomplished in the following ways:
  • By using Hadoop transparent encryption with Apache KMS (Key Management Server)
  • By using IBM BigSQL data masking
  • By enabling SSL and TLS support for components to secure the data transfers

Access management should be enforced at several levels:
  • At the job level by using Yet Another Resource Scheduler (YARN) job-queue-based access control.
  • By using SQL access privileges for SQL access of Hadoop data.
  • By using ACL- based access control for Hadoop Distributed File System (HDFS) files.

Audits and reporting are provided by the following items:
  • By using light-weight monitoring that uses Java Management Extensions (JMX)
  • By using IBM Security Guardium® Data Activity Monitor

Figure 2 shows a high-level design of a secure IBM BigInsights cluster. It highlights various components that are the building blocks of a big data cluster architecture design.

This figure shows the IBM BigInsights high-level design
Figure 2. IBM BigInsights high-level design

Note the following items in Figure 2:
  • An IBM BigInsights cluster can span over two networks: Public and private networks. The communication between the two networks occurs through an edge node (1).
  • This edge node has the client components (2) for all the master services that are deployed in the cluster so that users can connect and perform analytics and administration. Access to administration and analytic tools is enforced through Ambari user management and the Knox gateway (3).
  • Data encryption protects user data from unauthorized access and enforces industry security standards. Data can be encrypted at rest and while it is being transferred over the network. Encryption at rest is performed in two ways: By using Hadoop transparent data encryption and by using IBM Security Guardium Data Encryption.

    • Hadoop transparent data encryption uses the key management server (KMS), which holds encryption and decryption keys. (4)
    • IBM Security Guardium Data Encryption deploys agents on all nodes to perform encryption and decryption of data. The IBM Security Guardium server monitors and enforces encryption and decryption policies and rules on the agents. (5)
    The data transfers over the network are secured by configuring services to use SSL and TLS certificates.
  • Similar to the Linux file system, Hadoop Distributed File System (HDFS) also provides fine-grained user access control by using file system access control lists (ACLs) (6). Big SQL and Hive provide Grant and Revoke commands to authorize users to perform certain operations (7).
  • IBM Security Guardium Data Activity Monitor provides monitoring and auditing capabilities that you can use to integrate seamlessly Hadoop data protection into your existing enterprise data security strategy. HDFS has its own auditing mechanism that captures all the file system activities (8).

Designing the security architecture for IBM BigInsights products involves the features that are provided by individual components and a holistic approach that involves securing data, users, and functions from possible vulnerabilities.


Acknowledgements

Thanks to Mohan Dani, IBM BigInsights software developer, for his contributions to this project.


Related publications


Special Notices

This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.

Follow IBM Redbooks

Follow IBM Redbooks