1. Introduction
Humanity produces massive amounts of sensitive data daily, and organizations need to manage everything from personal information and financial records to classified documents and cybersecurity logs. Statistics show that traditional databases often struggle with both the volume of Big Data and the complex security requirements of modern enterprises.
Since the role of secure data management in the modern world has become more strategic and the majority of organizations require fine-grained access controls down to the cell level, we need a database system capable of handling massive datasets while maintaining strict security protocols – at the scale of petabytes of data with billions of individual access decisions.
In this introductory article, we’ll explore Apache Accumulo, a powerful distributed key-value store with unparalleled cell-level security, high performance, and scalability.
2. What Is Apache Accumulo?
Apache Accumulo, originally developed by the National Security Agency (NSA) based on Google’s Bigtable design, is a distributed key-value store.
Built on top of Apache Hadoop and Apache ZooKeeper, it’s designed to handle massive data volumes across clusters of commodity hardware.
Accumulo enables efficient data ingestion, retrieval, and storage. It also provides server-side programming to allow complex data processing directly within the database, making it a sophisticated solution with fine-grained access control to handle sensitive big data.
The key features of Apache Accumulo are the following:
- Scalability: can manage petabytes of data across large clusters
- High Performance: uses in-memory processing and optimizations for efficient data access
- Cell-Level Security: allows fine-grained access control, where each cell can have a unique visibility label
- Rich API for Customization: offers features like iterators for in-database processing
Similar to Google’s Bigtable, which is utilized in web indexing, Google Earth, and Google Finance, Apache Accumulo is useful in a variety of applications but is not limited to:
- Government and military data systems
- Healthcare record management
- Financial services data
- Cybersecurity analytics
- Large-scale graph processing
3. Installation and Setup
First, let’s make sure that prerequisites like Java 11, Apache Hadoop, YARN, and Apache ZooKeeper are installed along with corresponding JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME set in the path.
Then, we’ll download the latest version of Apache Accumulo and extract it:
$ tar -xzf accumulo-2.1.3-bin.tar.gz
Likewise, we can add ACCUMULO_HOME to the path variable:
$ export ACCUMULO_HOME=/path/to/accumulo
$ export PATH=$ACCUMULO_HOME/bin:$PATH
Next, we start services like ZooKeeper, Hadoop HDFS, and YARN in that order:
$ zkServer start
$ start-dfs.sh
$ start-yarn.sh
Also, we need to make sure that HDFS starts on localhost:8020 and the ZooKeeper host is set to localhost:2181, since these are the default properties set in the accumulo.properties.
Let’s confirm everything is running perfectly using the jps command, which should show the output like:
82306 Main
81385 DataNode
81745 ResourceManager
82867 Jps
81846 NodeManager
81530 SecondaryNameNode
68474 ResourceManager
81276 NameNode
Now, we’re ready to set up Accumulo to store data in ZooKeeper and HDFS:
$ accumulo init
The init command is required only once and prompts for instance name and root password.
Then, we’ll create additional configuration files required to start the cluster:
$ accumulo-cluster create-config
Finally, we’re ready to start the cluster:
$ accumulo-cluster start
Once started, we can run the Accumulo shell – a command-line tool for interacting with Apache Accumulo:
$ accumulo shell -u root
Note: This command asks to set the instance name and password in the accumulo-client.properties.
Accumulo Shell provides basic commands to manage, query, and perform administrative tasks on tables and instances.
Let’s take a look at a few commands that are most handy:
- tables: lists all tables in the instance
- createtable <table>: creates a new table
- deletetable <table>: deletes a table
- scan: scans and displays data from the current table
- insert <row> <colfam> <colqual> <value>: inserts a value into the table
- delete <row> <colfam> <colqual>: deletes a specific entry from the table
- setiter -t <table>: sets a table-specific iterator
- listiter [-scan | -table]: lists the iterators for a scanner or a table
- createuser <username>: creates a new user
- info: displays system information about the Accumulo instance
- config: views or changes configuration settings
- flush <table>: forces a flush of memory to disk for a table
- compact <table>: compacts the table’s data
4. Data Model
The Accumulo data model is similar to Google’s Bigtable, providing a sparse, distributed, persistent multi-dimensional sorted map.
Specifically, the key of the Accumulo instance consists of three components (helping it to be unique for each value stored):
- Row ID: The primary identifier for a row of data, used for lexicographical sorting of data
- Column:
- Family: Columns are grouped into families, which act as categories or namespaces for the data. Column families provide a way to organize related data.
- Qualifier: Within each column family, individual columns are identified by a column qualifier. This allows for fine-grained differentiation of data within a column family.
- Visibility: Each key-value pair can be associated with a security label or visibility. This allows for cell-level access control, where users must have the appropriate authorizations to read the data.
- TimeStamp: A version number associated with each key-value pair, allowing Accumulo to store multiple versions of the same data
Overall, the Accumulo data model provides a flexible and secure framework for managing large-scale, structured datasets with intricate security needs.
Its use of row IDs, column families, and qualifiers enables robust data organization and querying, while cell-level visibility controls ensure the protection of sensitive information.
5. Operations and Features
5.1. Basic Table Operations
Accumulo offers robust capabilities for managing tables. We can create new tables as needed, clone existing tables for testing or development purposes, and split large tables into smaller tablets for performance optimization.
Additionally, tables can be merged to consolidate data and improve query efficiency. Accumulo also supports flexible data import and export operations, enabling seamless data migration and integration with other systems.
5.2. Data Handling
Accumulo provides fundamental data manipulation to create, update, and delete data. For efficient handling of large datasets, Accumulo offers batch operations, allowing for the bulk processing of data.
Furthermore, range-based scans enable efficient retrieval of specific data subsets, optimizing query performance.
5.3. Security Features
Accumulo provides cell-level security by setting security labels for every piece of data. We can set up complex security rules using boolean expressions and manage user access to enforce fine-grained authorization policies.
5.4. Iterator Framework
Accumulo provides powerful Iterators that act as on-the-spot data processors, working directly where the data resides. They handle filtering, aggregating, and transforming data on the server itself, so we don’t need to send large amounts of raw data over the network.
This results in faster query processing, greater efficiency, and reduced network traffic.
5.5. Performance Optimizations
Accumulo incorporates various performance optimizations like write-ahead logging, memory-based writing, Bloom filters, and Locality groups to ensure efficient data storage and retrieval.
Write-ahead logging guarantees data durability, while memory-based writing accelerates data ingestion. Bloom filters enable fast lookups, reducing the need for full table scans. Locality groups optimize data placement, improving read and write performance.
5.6. Scaling and Distribution
Accumulo automatically splits tablets and balances the load when more data is added, and integrating new machines into the cluster is as simple as pointing them to it. The system manages data distribution smoothly as the data expands.
5.7. Real-Time Insights
Accumulo provides real-time insights by allowing us to monitor performance metrics, track resource usage, and detect issues as they arise.
With its efficient data processing capabilities and integration with monitoring tools, we can quickly respond to changes and ensure optimal system performance.
5.8. Administration
Accumulo offers robust administrative capabilities, including reliable backup and recovery mechanisms, intelligent data compaction strategies, and flexible system configuration options.
It also provides benefits like comprehensive user management and resource control, ensuring secure access and optimal performance.
6. Accumulo Clients
Now that we’ve covered Accumulo’s installation process, data model, operations, and features, let’s explore Accumulo Clients to interact with Accumulo through Java API.
The Accumulo Client API allows us to perform administrative tasks, query data, and manage tables programmatically.
6.1. Maven Dependency
First, let’s add the latest accumulo-core Maven dependency to our pom.xml:
<dependency>
<groupId>org.apache.accumulo</groupId>
<artifactId>accumulo-core</artifactId>
<version>2.1.3</version>
</dependency>
This dependency adds the necessary classes and methods to work with Accumulo.
6.2. Create the Accumulo Client
Next, let’s create a client to interact with Accumulo:
AccumuloClient client = Accumulo.newClient()
.to("accumuloInstanceName", "localhost:2181")
.as("username", "password").build();
We’ve used the builder method to initialize the connection by specifying the Accumulo’s instance name, ZooKeeper host details, username, and password of the Accumulo instance.
6.3. Basic Operations
Next, with the client set up, let’s perform the basic operation of creating a table:
client.tableOperations().create(tableName);
Then, to add data to the table, we can use the BatchWriter class that offers high-performance, batch-oriented writes:
try (BatchWriter writer = client.createBatchWriter(tableName, new BatchWriterConfig())) {
Mutation mutation1 = new Mutation("row1");
mutation1.at()
.family("column family 1")
.qualifier("column family 1 qualifier 1")
.visibility("public").put("value 1");
Mutation mutation2 = new Mutation("row2");
mutation2.at()
.family("column family 1")
.qualifier("column family 1 qualifier 2")
.visibility("private").put("value 2");
writer.addMutation(mutation1);
writer.addMutation(mutation2);
}
Here, each entry is represented by the Mutation object that accepts column info like family, qualifier, and visibility as discussed previously in the data model.
Similarly, let’s retrieve data from the table using the Scanner class:
try (var scanner = client.createScanner(tableName, new Authorizations("public"))) {
scanner.setRange(new Range());
for (Map.Entry<Key, Value> entry : scanner) {
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
}
Here, we iterate over rows within a specified range that scans the entire table, and we apply filters like authorizations, ensuring only publicly visible data is fetched.
7. Conclusion
In this tutorial, we’ve discussed Apache Accumulo, a versatile, scalable database that excels in handling massive datasets with complex access requirements.
Its unique features, such as cell-level security, iterators, and flexible data models, make it an excellent choice for applications requiring secure and efficient data management for real-time analytics, secure data processing, or large-scale data storage.
First, we explored the steps for installation and setup. Then, we educated ourselves with its unique data model. Last, we familiarized ourselves with the available operations and features.
The complete code for this article is available over on GitHub.
The post Introduction to Apache Accumulo first appeared on Baeldung