The professional world for data experts has been evolving for quite some time. However, in more recent years there has been a fundamental shift both in the technical landscape and the culture surrounding data professionals. These recent evolutions in data management have led to the creation of the field called data engineering. Before I delve into what data engineering looks like today, it helps to discuss how we got here.
Data Engineering: A Historical Perspective
Prior to the arrival of "big data" systems, data professionals typically carried job titles such as "Database Developer", "Database Administrator", "Data Architect", and "BI Developer". Keep in mind these jobs still exist today. I have personally carried all of these titles at one point in my career before building and selling my own systems integrator. Each of these job titles typically has some degree of overlap depending on the employing organization's IT culture. Below I briefly define each of these yesteryear roles:
-
Database Developer: Focused on the development, maintenance, and tuning of database objects including ETL/ELT workflows when employed in OLAP/enterprise data warehouse scenarios.
-
Database Administrator: Focused on the installation, configuration, and daily operations of database management systems.
-
Data Architect: Focused on the logical and optionally physical implementation of data processing systems.
-
BI Developer: Focused on the development and maintenance of enterprise data warehouses, reporting, and OLAP cubes.
You may notice the absence of the data scientist. That's because that role became prominent only in the past 10 years. It was the business analyst that did the bulk of number crunching, mostly from a historical perspective (a.k.a. Business intelligence). All of these roles generally worked on symmetric multiprocessing database systems (SMP). Examples include PostgreSQL, MySQL, SQL Server, Oracle, and DB2. However, as massively parallel processing (MPP) databases became more prevalent in the 2000s, the above roles began working with ever increasing data volumes. And example of an MPP database is Greenplum Database, an open source MPP data warehouse with a PostgreSQL heritage. Enterprises were continuing to invest more in their data processing systems with the aim of uncovering insights in these huge volumes of data that would result in competitive advantage over rivals.
The Data Engineer is Born
While the roles I described earlier are still important to this day, they were not sufficient in a big data world. Data engineering was born with the mainstream adoption of “big data” architectures and systems – most notably Apache Hadoop. In addition, many other non-database frameworks emerged across the data processing and persistence landscape. Unlike the data professionals of the past, data engineers must have the ability to develop across a large variety of languages and processing frameworks.
The first formal data engineers were primarily focused on Apache Hadoop. They were paid to conduct data wrangling experiments leveraging experimental software developed by some of the largest data crunching organizations such as Facebook, Google, and Yahoo. By 2010, mainstream enterprises were adopting Hadoop. The field of data engineering went from a small niche to becoming a major trend in IT shops across the country. This was the first generation of formal data engineers.
Another important theme that drove the new field was the clear distinction between those that worked with data processing systems and those that derived advanced analytics from them. No longer was it enough to collate, cleanse, and display what had happened in the past. Companies needed the ability to both forecast what would happen (predictive insights) as well as what action to take on such future events (prescriptive insights). While these advanced analytics were possible in the past, few had the necessary skills, and thus the birth of the data scientist. The data scientist provides the perfect complement to the data engineer.
Data Engineering: 2017 and Beyond
Today, the second generation of data engineers is firmly established. Modern day data engineers have to be even more adaptable as the pace of technology continues to increase. Specialized data processing systems rule the industry landscape and the ability to quickly adapt is a critical skill. In addition to specialized data processing systems, the rise of public, private, and hybrid clouds have created an ever growing choice of data substrates.
Beyond adaptability, today’s data engineers are working on a second generation of big data architectures ranging from Internet of Things (IoT) to Logical Data Warehouses (LDW). IoT solutions present new challenges both in terms of data volume and velocity. A common example is the connected car, but there are numerous use cases across all industries for IoT. My team recently worked with one of the largest automotive insurance providers in the United States, building out real-time data ingest, persistence, and scoring parallel pipelines. The business case is accurately assessing driver risk and adjusting policy premiums precisely. Safe drivers get discounts while riskier drivers pay a penalty. This is a huge opportunity for auto insurance providers and for drivers.
The LDW architecture emerged because of the distributed nature of data sources today. It is simply impractical to bring all data into a centralized warehouse for analysis. In the LDW architecture, a data warehouse, such as Pivotal Greenplum, is configured to process data across a variety of backend storage systems – both on-premises and in the cloud. This is also sometimes called data virtualization or data federation.
One of our most advanced Pivotal Greenplum customers in the marketing analytics space leverages Amazon S3 storage as a public cloud “landing zone”, whereby data is exported from a larger centralized on-premise Greenplum cluster and ingested into S3. The analytics company then spins-up and spins-down (ephemeral) on-demand Pivotal Greenplum clusters in the cloud that can both read (process) and write-out resulting analytics back to S3. The business case is marketing analytics customers (consumer brands) desire to have their own “analytic sandboxes” to perform additional slicing-and-dicing beyond the standard set of analytics the company provides.
Additionally, “fast data” architectures are on the rise in the enterprise and deploying and managing them are increasingly important responsibilities of the modern data engineer. Just like the SMP databases of the past were not designed for high-performance analytics, they were also not designed for sub-second, consumer-grade concurrency applications either. New technologies, such as Pivotal GemFire, a purpose built in-memory data grid, are emerging to support these use cases.
A next-generation, digital automotive dealer provider chose Pivotal GemFire to empower its customers' online presence. The company onboards several new dealerships each day onto its Pivotal GemFire clusters. Dealership customers enjoy a fast-and-consistent experience while developers enjoy a powerful-and-flexible next-generation object store.
The future of data engineering is bright as corporations continue to invest in hybrid data architectures that span on-premises and the cloud for competitive advantage. Next-generation architectures powered by open source software including IoT and LDW enable new capabilities such as taking action in real-time on events as they occur. And data scientists are empowered by the work of data engineers to glean both predictive analytics and prescriptive actions for the consuming enterprise, thus being highly complementary in nature!
I am humbled and proud to work in Pivotal’s field-facing data team, which encompasses some of the industry’s brightest minds across both the data engineering and data science disciplines. We help our valued customers successfully design and implement Pivotal data solutions, architectures, and advanced analytics. We get to tackle some of the industry’s most challenging and rewarding use-cases!
To learn more about the role of the data engineer in analytics and data science use cases, be sure to join us at the Pivotal Analytics Innovation Roadshow, coming to a city near you. Register today!