AWS brews the glue to prepare data

  • November 18, 2020
  • Steve Rogerson

Amazon Web Services has announced the general availability of AWS Glue DataBrew, a visual data preparation tool that lets users clean and normalise data without writing code.

Since 2016, data engineers have used AWS Glue to create, run and monitor extract, transform and load (ETL) jobs. AWS Glue provides both code-based and visual interfaces, and has simplified extracting, orchestrating and loading data in the cloud.

Data analysts and data scientists have wanted an easier way to clean and transform these data, and that’s what DataBrew delivers, with a service that allows data exploration and experimentation directly from AWS data lakes, data warehouses and databases without writing code.

DataBrew provides more than 250 pre-built transformations to automate data preparation tasks – such as filtering anomalies, standardising formats and correcting invalid values – that would otherwise require days or weeks writing hand-coded transformations. Once the data are prepared, users can immediately start using it with AWS and third-party analytics and machine-learning services to query the data and train machine-learning models.

There are no upfront commitments or costs to use DataBrew, and users only pay for creating and running transformations on datasets.

Preparing data for analytics and machine learning involves several necessary and time-consuming tasks, including data extraction, cleaning, normalisation, loading and the orchestration of ETL workflows at scale. For extracting, orchestrating and loading data at scale, data engineers and ETL developers skilled in SQL or programming languages such as Python or Scala can use AWS Glue.

ETL developers often prefer the visual interfaces common in modern ETL tools over writing SQL, Python or Scala, so AWS recently introduced Glue Studio, a visual interface to help author, run and monitor ETL jobs without having to write any code. Once the data have been reliably moved, the underlying data still need to be cleaned and normalised by data analysts and scientists that operate in the lines of business and understand the context of the data.

To clean and normalise the data, data analysts and scientists have either to work with small batches of the data in Excel or Jupyter Notebooks, which cannot accommodate large data sets, or rely on scarce data engineers and ETL developers to write custom code to perform cleaning and normalisation transformations.

In an effort to spot anomalies in the data, skilled data engineers and ETL developers spend days or weeks writing custom workflows to pull data from different sources, then pivot, transpose and slice the data multiple times before they can iterate with data analysts or scientists to identify and fix data quality issues. After they have developed these transformations, data engineers and ETL developers still need to schedule the custom workflows to run on an ongoing basis, so new incoming data can automatically be cleaned and normalised.

Each time a data analyst or scientist wants to change or add a transformation, the data engineers and ETL developers need to extract, load, clean, normalise and orchestrate the data preparation tasks over again. This iterative process can take several weeks or months to complete and, as a result, users spend as much as 80% of their time cleaning and normalising data instead of actually analysing the data and extracting value from them.

DataBrew is a visual data preparation tool for AWS Glue that allows data analysts and data scientists to clean and transform data with an interactive, point-and-click visual interface, without writing any code. With DataBrew, end users can access and visually explore any amount of data across their organisation directly from their Amazon Simple Storage Service (S3) data lake, Redshift data warehouse, and Aurora and Amazon Relational Database Service (RDS) databases.

Users can choose from more than 250 built-in functions to combine, pivot and transpose the data without writing code.

DataBrew recommends data cleaning and normalisation steps such as filtering anomalies, normalising data to standard date and time values, generating aggregates for analyses, and correcting invalid, misclassified or duplicative data.

For complex tasks such as converting words to a common base or root word (say converting “yearly” and “yearlong” to “year”), DataBrew also provides transformations that use machine-learning techniques such as natural language processing.

Users can save these cleaning and normalisation steps into a workflow called a recipe and apply them automatically to future incoming data. If changes need to be made to the workflow, data analysts and scientists simply update the cleaning and normalisation steps in the recipe, and they are automatically applied to new data as they arrive.

DataBrew publishes the prepared data to Amazon S3, which makes it easier for users to use it immediately in analytics and machine learning. It is serverless and fully managed, so users never need to configure, provision or manage any compute resources.

“AWS customers are using data for analytics and machine learning at an unprecedented pace,” said Raju Gulabani, vice president at AWS. “However, these customers regularly tell us that their teams spend too much time on the undifferentiated, repetitive and mundane tasks associated with data preparation. Customers love the scalability and flexibility of code-based data preparation services like AWS Glue, but they could also benefit from allowing business users, data analysts and data scientists to visually explore and experiment with data independently, without writing code. AWS Glue DataBrew features an easy-to-use visual interface that helps data analysts and data scientists of all technical levels understand, combine, clean and transform data.”

One user is Tokyo-based NTT Docomo, the largest mobile service provider in Japan, serving more than 80 million customers.

“Our analysts profile and query various kinds of structured and unstructured data in order to better understand usage patterns,” said Takashi Ito, general manager of marketing at NTT Docomo. “AWS Glue DataBrew provides a visual interface that enables both our technical and non-technical users to analyse data quickly and easily. Its advanced data profiling capability helps us better understand our data and monitor the data quality. AWS Glue DataBrew and other AWS analytics services have allowed us to streamline our workflow and increase productivity.”