DataEngConf NYC - Tying It All Together
From Jason at Ideal Prediction!
DataEngConf had its first NY conference this fall with the goal of bringing Data Engineers and Data Scientists together. There wasn’t a unifying theme or tagline, but the ideas that kept coming up to me were
Making it easier for data to be exchanged and analyzed and
Broadening the use cases for tools that were originally built around specific niches.
As a team that works with both R and Pandas and with clients that use one or both, interoperability is huge for us. In particular, we’re really excited about the Apache Arrow project, which aims to be a high performance interface. In creating intermediate files in the data pipeline that need to be passed to different platforms or languages, around 80% of our run time has been spent on reading/writing CSVs.
Apache Arrow aims to cut the serialization/deserialization process out of the flow, which could, with no other changes in our code, allow us to speed up data transformation roughly 4x. While Apache Arrow is still a ways away from being broadly implemented, R and Pandas have already worked on an alpha implementation for use between the two called Feather.
Other things we were excited to see:
- The Hadoop ecosystem seems increasingly focused on ingesting messier and more varied data sources
- Kafka has built a third client called Kafka Streams, which offers a set of lightweight libraries aimed at augmenting the capabilities of the simpler basic producer and consumer client paradigm without the much more complex architectures of Spark or Storm
On a less technical note, the idea of making data and data exploration more accessible to everyone was pervasive at the conference. Several Data Engineers mentioned in their talks that one of their goals was to not only make it easier for Data Scientists to run more complex experiments and iterate faster, but to allow people in roles not traditionally considered Data Science to explore the data and feel empowered to form insights on their own. We can’t agree more -- making data easy to explore and interact with is one of our core focuses here at Ideal Prediction.