Data Disposal - Open Source Java-based Big Data Retention Tool
<p>By <a href="https://www.linkedin.com/in/samuel-groth-6691bb25/">Sam Groth</a>, Senior Software Engineer, Verizon Media</p><p><b></b></p><p>Do you have data in Apache Hadoop using Apache HDFS that is made available with Apache Hive? Do you spend too much time manually cleaning old data or maintaining multiple scripts? In this post, we will share why we created and open sourced the <a href="https://github.com/yahoo/data-disposal">Data Disposal tool</a>, as well as, how you can use it.</p><p><b></b></p><p>Data retention is the process of keeping useful data and deleting data that may no longer be proper to store. Why delete data? It could be too old, consume too much space, or be subject to legal retention requirements to purge data within a certain time period of acquisition.</p><p><b></b></p><p>Retention tools generally handle deleting data entities (such as files, partitions, etc.) based on: duration, granularity, or date format.</p><p><b></b></p><ol><li><b>Duration:</b> The length of time before the current date. For example, 1 week, 1 month, etc.</li><li><b>Granularity:</b> The frequency that the entity is generated. Some entities like a dataset may generate new content every hour and store this in a directory partitioned by date.<br/></li><li><b>Date Format: </b>Data is generally partitioned by a date so the format of the date needs to be used in order to find all relevant entities.<br/></li></ol><p><b></b></p><p><b>Introducing Data Disposal</b></p><p>We found many of the existing tools we looked at lacked critical features we needed, such as configurable date format for parsing from the directory path or partition of the data and extensible code base for meeting the current, as well as, future requirements. Each tool was also built for retention with a specific system like Apache Hive or Apache HDFS instead of providing a generic tool. This inspired us to create Data Disposal.<br/></p><p>The Data Disposal tool currently supports the two main use cases discussed below but the interface is extensible to any other data stores in your use case.</p><ol><li>File retention on the Apache HDFS.</li><li>Partition retention on Apache Hive tables.</li></ol><p><b>Disposal Process</b><br/></p><figure data-orig-width="1256" data-orig-height="266" class="tmblr-full"><img src="https://66.media.tumblr.com/2ca7584c2c0ee20383c9a881907f8959/2cbee9fdee01a12d-45/s540x810/7330663d8f9d20168de6f08fdb3599afa48b9b96.png" alt="image" data-orig-width="1256" data-orig-height="266"/></figure><p><b></b></p><p>The basic process for disposal is 3 steps:</p><ul><li>Read the provided yaml config files.</li><li>Run Apache Hive Disposal for all Hive config entries.<br/></li><li>Run Apache HDFS Disposal for all HDFS config entries.<br/></li></ul><p><b></b></p><p>The order of the disposals is significant in that if Apache HDFS disposal ran first, it would be possible for queries to Apache Hive to have missing data partitions.</p><p><b>Key Features</b><br/></p><p><b></b></p><p>The interface and functionality is coded in Java using Apache HDFS Java API and Apache Hive HCatClient API.</p><p><b></b></p><ol><li>Yaml config provides a clean interface to create and maintain your retention process.<br/></li><li>Flexible date formatting using Java’s SimpleDateFormat when the date is stored in an Apache HDFS file path or in an Apache Hive partition key.<br/></li><li>Flexible granularity using Java’s ChronoUnit.<br/></li><li>Ability to schedule with your preferred scheduler.<br/></li></ol><p><b></b></p><p>The current use cases all use <a href="https://screwdriver.cd/">Screwdriver</a>, which is an open source build platform designed for continuous delivery, but using other schedulers like cron, Apache Oozie, Apache Airflow, or a different scheduler would be fine.</p><p><b></b></p><p><b>Future Enhancements</b></p><p>We look forward to making the following enhancements:<br/></p><ol><li>Retention for other data stores based on your requirements.</li><li>Support for file retention when configuring Apache Hive retention on external tables.<br/></li><li>Any other requirements you may have.<br/></li></ol><p>Contributions are welcome! The Data team located in Champaign, Illinois, is always excited to accept external contributions. Please <a href="https://github.com/yahoo/data-disposal/issues/new">file an issue</a> to discuss your requirements. <br/></p>