Tap Into the Insights of Big Data: A Q&A Spotlight with Josh Rogers of Syncsort

Originally published 28 December 2011

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Josh Rogers, Vice President of Worldwide Field Sales for Syncsort. Ron and Josh talk about how organizations can achieve greater performance and greater throughput in their integration architectures for data warehousing and big data.

Josh, enterprises of all sizes today are struggling with how they can scale their data integration environment to tap into the insights within “big data.” What are you hearing from your customers and prospective customers regarding this scale issue for big data?

Josh Rogers: We see customers really struggling with this challenge, and I think it's becoming very much a mandate from a business user perspective and the IT group is trying to figure out how to tackle the challenge. We see two basic components. The first is simply one of scale. How do they look at their current architecture and figure out how they scale that to 10, 50 or 100 times the data volumes from an integration perspective? That presents a technical challenge that I think requires new architectural approaches. But the second piece, and I think one that people don't necessarily think through as carefully as they probably should, is how they do that in a cost-effective way? We see organizations that frankly don't have a good sense necessarily of what their cost structure is to integrate day to day, much less how they're going to scale that environment in a cost-effective way. Syncsort offers unique solutions to help people achieve that scale. Today, we see people relying on an endless labor-intensive tuning approach, and we see people trying to throw hardware at the problem and getting diminishing returns. We believe we have some unique approaches that allow people to attack not just the ability to scale to the volumes they need, but also an ability to do that in a cost-effective way.

That certainly confirms what we hear from our readers about how this huge influx of data from many different sources is really increasing their workload and significantly impacting performance and scalability. How can Syncsort help with these problems?

Josh Rogers: Well, I think that breaks down the problem in a new dimension that you raised in your first question. Not only is it that they need to be able to integrate and process larger data volumes, but they also have to do that across a wider variety of data sources.

These are data sources that come from both inside and outside of the enterprise. What Syncsort offers is a broad set of connectivity and, more importantly, an architectural approach that allows our solution to do a lot of that work. So as data volumes and data types change, we achieve performance through what we call self-tuning. We are able to actually tune our jobs and our approaches to continue to increase performance and throughput without the developers having to manually configure those performance settings. That's a big differentiator that we bring to the data integration landscape, and we think it's absolutely critical to take advantage of the insights of big data where your sources are always going to be changing and your volumes are always going to be growing.

In the early of data warehousing, ETL – extract, transform, and load – was the main approach to data integration. Then a few years ago, some organizations started with an ELT approach – extract, load and then transform. This shifted the processing load to the database to take advantage of unused capacity, but obviously there isn’t a lot of unused capacity anymore and data volumes are increasing dramatically. Syncsort is now focusing on what you call ETL 2.0. Can you explain ETL 2.0 and its benefits?

Josh Rogers: I think you're exactly right that in order to achieve greater performance and greater throughput in their integration architectures, companies have looked at the spare database capacity they used to have and started to point that toward core data integration paths. Then when you look at the vendor landscape in the data integration world, you see that vendors have been expanding functionality, but not necessarily focusing on increasing the performance and the scale of the core ETL engine. What we're suggesting is that as data volumes have grown, ELT implementations have become very expensive for customers, and they have started to crowd out the day job of the database. We're not here to say that that ELT is something that should never be done. But we are here to say that there is a pendulum, and we think that the pendulum has swung too far in the direction of ELT. People now have databases that users are querying where 40% to 60% of the database capacity is actually being put toward data integration type workflow versus user query.

We believe that if you can offer a scalable cost-effective core ETL engine to perform ETL at scale for big data volumes, that's an important piece of the information management architecture that can help you attack the challenges of scale in a big data environment.

So, ETL 2.0 is really about how to provide a next-generation core data processing engine in the ETL layer that can scale to the terabytes people need, and do it in a cost-effective way both from a resource perspective as it relates to hardware, memory and CPU, and also from a labor perspective. How do companies get that performance without having to do constant manual care and feeding of that ETL environment to achieve performance?

Syncsort offers a unique approach for this, and we built upon that approach in our recent DMExpress 7.0 release where we added components within that solution that are enabling us to allow end users to tap into the insight of big data, and do it at a reasonable total cost of ownership.

DMExpress 7.0 builds upon your unique approach to ETL and high performance. Could you give us some of the highlights in that new release?

Josh Rogers: We made some big investments around continuing our performance leadership. We improved our throughput about 30% for large-scale joins that happen in data integration environments. We made investments to support additional functionality around application modernization efforts in terms of allowing customers to bring data from legacy environments into an open systems environment.

We also announced our Hadoop edition, and this is a version of DMExpress that will actually run on top of Hadoop and allow customers to bring data into Hadoop as well as take data out of Hadoop environments. It also gives them a graphical user interface to build processing logic for Hadoop versus writing MapReduce. And finally, it will actually increase the performance and efficiency of each node in a Hadoop cluster, which we believe is pretty unique. When you look at the breadth of the release, you'll see that we're adding capability that allows people to take advantage of legacy data assets they have in the form of application modernization, and also bring those assets into Hadoop environments for large-scale data processing.

We've also put significant investment into continuing to improve the performance of the core engine. If you look at those things, those are pretty unique in the data integration industry.

In conjunction with big data and Hadoop, another major area of interest for our audience is the cloud. What does Syncsort offer for organizations that are combining their Hadoop initiatives with the cloud?

Josh Rogers: We believe that Hadoop is going to be a core component to information management architectures going forward. I recently had the opportunity to be at Hadoop World, and I don't think there's any question that that's the case. In fact, one of the reasons I believe that's the case is because it's a key enabling architecture for allowing large-scale cost-effective data processing from a data integration perspective and also from query perspective. Our focus is really to allow people to leverage that core capability and not only be able to move data in and out, but also reflect the business logic they want to process. And finally, allow them to take that already scalable environment and improve performance on a permanent basis, which will continue to drive down the cost structure of tapping into the insights of big data.

Very good. One thing we haven’t talked about is your customer base. You have some very large Fortune 500 customers. Is that your typical customer?

Josh Rogers: That's a great question. I think one of the things that people need to understand is the claims that we're making around performance come from a long history of doing this with the largest companies in the world. Syncsort was founded in 1968, obviously initially focused on mainframe solutions. Our core architectural principle was to allow people to process large volumes of data and do it in a very resource efficient way – both from a hardware and a labor perspective. We carry that initial mission of the late ‘60s and early ‘70s all the way into our solution today that we are now deploying on top of Hadoop environments.

If you look at our customer base, we have about 4,000 customers across our data integration business, and most of those are large enterprises. What we do see as a very growing and vibrant space is medium-size organizations that are laser focused on tapping into the value of big data. A great example of that is comScore. They provide media metrics for the web, and they process over 50 terabytes a month leveraging DMExpress They provide a really good example of the types of business models  and value that can be created from big data when they use a core processing engine like DMExpress that can help support that in a cost-effective manner.

My next question is along the lines of data integration. We hear a lot about data virtualization and federation. Many people assume there's a diminishing need for data integration because there now are other ways to get data from various source systems and make it available to users and applications. Do you see data integration as a continuing enterprise need?

Josh Rogers: I think that businesses in general understand that it's a strategic imperative to be able to take advantage of the insights that they can glean from their key asset – the data that they're generating on a daily basis. I think as they look to take advantage of that, you're going to see new styles of data integration that will be critical to enabling certain pieces of strategies. But what I will say is, and you can look at industry reports, is that there's a tremendous amount of integrations that have been built and that serve their end users well, but they need to be scaled in a cost-effective manner. I think that virtualization and federation can play a role in meeting certain needs in certain use cases. But when you think about the core infrastructure and the investments that people have made in delivering a set of integrations and now they want to scale those, I think data integration is here to stay. The question is how can they plug something into that environment that helps scale it in a cost-effective manner and do it at scale both from a query and a processing perspective. I think that's where ETL and ETL 2.0 can play a really unique role. That's not to say that virtualization and federation aren't necessarily powerful tools to be used in the right way, but I don't think that one is going to replace the other.

Josh your points earlier about not doing all of the transformations in the database is really key because the volumes are getting such that people can't afford to continue to expand these databases and still get the performance they need. I think that's a really good point you made.

Josh Rogers:
Well, I appreciate that. I think Hadoop is a good example of people understanding that they need to find another area and a lower cost structure to be able to perform these integrations. What we're excited about and why we're putting a lot R&D effort into the Hadoop space is that we can help build out the full value proposition that Hadoop begins to promise and make that real for customers. In fact, we're doing that, and we look forward to working with customers and prospects in the coming year to help them take advantage of architectures like Hadoop as well as, obviously, our core DMExpress offerings.

Well, that sounds great, Josh. Thanks for taking the time to update BeyeNETWORK readers about Syncsort, DMExpress and what's happening in the big data world.

SOURCE: Tap Into the Insights of Big Data: A Q&A Spotlight with Josh Rogers of Syncsort

  • Ron PowellRon Powell
    Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010.  Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com. 

    More articles and Ron's blog can be found in his BeyeNETWORK expert channel.

Recent articles by Ron Powell



Want to post a comment? Login or become a member today!

Be the first to comment!