By Irem Radzik
Apache Kudu was designed specifically for use-cases that require low latency analytics on rapidly changing data, including time-series, machine data, and data warehousing. Its architecture provides for rapid inserts and updates coupled with column-based queries – enabling real-time analytics using a single scalable distributed storage layer.
However, the latency and relevancy of the analytics are only as good as the latency of the data. In order to gain the most out of the speed of Kudu, you need to deliver data to it in real-time, as soon as possible after the data is originally created.
Here are the top 5 considerations when choosing a data integration solution for Kudu.
1. Does it support the data sources you need for the core use cases with real-time, continuous ingestion capabilities?
In order to provide holistic analysis of enterprise data, it is crucial to be able to ingest all relevant information, regardless of the source. If there are gaps in data sources or data types supported by your integration software, you may be limited in the content that can be fed into Kudu. This can lead to partial or, even possibly, misleading insight from your reports or analytics applications.
Having a real-time ETL solution that can ingest a wide range of dynamically changing data is a critical step in gaining value from your fast analytics platform. Whether the data is coming from databases, machine logs, applications, or IoT devices, it needs to be collected in real-time, micro-to-milli seconds after its genesis. This means utilizing the right technology and techniques to achieve very low-latency continuous data collection independent of the source.
2. Does it support your existing schemas, especially for the reports that you want to keep working while allowing for new use cases without extensive coding?
Real-time ETL is not only about ingestion related performance and resulting low latency, but also processing speed and flexibility. Having a solution that can efficiently perform a wide range of transformations, filtering, aggregation, masking, and enrichment to prepare the data for your existing schemas, will enable you to continue to support your current end users while possibly adding in new applications.
Especially for high-velocity and sensitive data, these data preparation steps need to take place before the data is delivered to your analytics environment so you can avoid introducing latency, optimize storage in Kudu, reduce on-disk processing workload, and fully comply with government regulations.
Not all use cases need milliseconds speed, but if you start going above a minute, in many use cases, you may decrease the chances of getting actionable insights from your high-velocity data. Kudu is completely designed for very low-latency use cases. Data is only available for querying after it has been processed and written to Kudu, so the ETL process should add as little latency as possible between creation and delivery to Kudu.
When real-time ETL for Kudu can perform the processing in-memory, while the data is in motion, it can scale to handle large volumes without introducing latency and will accelerate the time to insight. It also simplifies the overall data architecture, enabling end-to-end recoverability and full resiliency.
3. Is it secure and reliable in processing, including enabling exactly once processing (E1P) and delivery?
Security is always a key consideration and becomes more problematic if the real-time ETL involves multiple products. Enabling end-to-end security across various components can require a lot of effort to reliably meet your strict data security requirements. Choosing a real-time ETL solution with built-in end-to-end security saves you from significant risks and costs.
The same applies to reliability. Can you trust the insight from your fast analytics if the processing or delivery contains duplicates or misses data? Mission-critical use cases require full trust in the data you put in.
As we all know: Garbage in, garbage out. It becomes much more difficult to guarantee E1P if there are time windows involved, for example in aggregations. That’s why choosing a platform that automatically recovers after an outage without manual intervention will save you development and maintenance costs, as well as prevent inaccurate conclusions and actions.
4. Is the coding language accessible to all the groups that need to be involved?
The longevity of any solution is also dependent on how widely it is used and accepted within the organization. When a data integration tool is easy to understand and easy to use for different groups, especially for the business teams, its adoption will likely be faster.
While Java, Scala and other coding languages are popular for analytics solutions, using a SQL-based language to process the data will support that expansion to a large set of users. It also reduces the stress of maintaining harder-to-find skill sets to support the solution. In addition, providing the end users with an intuitive UI, in addition to command line options for power-users, will increase development productivity for this broad user group.
5. Does it provide you the flexibility to move the data to new targets, especially in the cloud?
Technology requirements change. While you might have a specific endpoint like Apache Kudu in mind today, new requirements may dictate other technologies in the future.
The flexibility to supply raw or processed data to other on-premises and cloud targets simplifies, and future-proofs your data architecture. Furthermore, for many data sources, especially database change data capture, reading the same data multiple times can add unacceptable overhead to source systems.
When looking at real-time ETL solutions, you should consider whether data read from a single source can be delivered to multiple targets simultaneously. These targets could include Kudu, as well as Kafka for real-time data distribution, and cloud technologies for elastic scalability.
When you feed pre-processed data from your high-velocity data sources to Kudu in real-time using a secure, reliable, and easy-to-use solution, you can gain the maximum benefit from your fast analytics applications with the least amount of effort.
About the author: Irem Radzik leads product marketing at Striim. Before working for Striim, Irem was the Director of Product Marketing for Oracle Cloud Integration product group. Irem has more than 18 years of product management and marketing experience in enterprise software, financial services and consulting industries—with a focus on data and application integration, and business analytics technologies. She joined Oracle with the acquisition of GoldenGate Software in 2009. Before GoldenGate Software, Irem worked at Siebel Systems (now part of Oracle), TIBCO Software and Enkata Technologies. She holds an M.B.A. degree from the University of Pennsylvania, Wharton School of Business.
The post 5 Key Data Integration Considerations for Apache Kudu appeared first on Datanami.
Read more here:: www.datanami.com/feed/
Knowledge is good.
Fellow fans of Animal House will recognize this as the delightfully silly and subversive message etched into the base of a statue at the fictional Faber College.
Silly or not, it’s also true. And in the knowledge-imparting department, Apple and Netflix are moving in opposite directions. Apple last year said it will stop giving out iPhone unit sales data. Investors took that as a bad sign of things to come–and were right. Thursday, Netflix accelerated its very recent practice of disclosing viewership data.
Investors were annoyed for other reasons, primarily slower-than-expected revenue growth and forecasts. But there are all sorts of reasons Netflix’s sharing is a happy development. Investors and partners will be glad finally to know how Netflix shows perform. Netflix had been stingy with data on the theory that because it sells no advertising it didn’t have to prove viewership to anyone. It already divulged paid customer, of course. That number is nearly 140 million worldwide.
Despite the swoon in Netflix’s stock–that happens frequently–there’s another reason for optimism about the company. Analyst Mark Mahaney reckons it is the least exposed FANG (Facebook, Amazon, Netflix, Google) company to regulatory risk. “Since it’s not an ad-supported business, Netflix may have much less reliance on personally identifiable information, a hot political issue these days,” Mahaney wrote to clients Thursday night. “Netflix is arguably more media- than tech-platform, and thus might avoid the current scrutiny that the latter are facing. And the CEO doesn’t personally own an influential newspaper that expresses opinions independent from political leaders.”
Here are the 50 best workplaces in technology, according to a new Fortune ranking.
Have a good long weekend. Data Sheet returns Tuesday.
Everything is awesome when you’re part of a team. In addition to Netflix’s quarterly report that Adam discussed, collaboration software maker Atlassian pleased Wall Street with its numbers. Revenue increased 39% to $299 million and adjusted earnings per share almost doubled to 25 cents. Both were better than analysts expected and Atlassian’s stock price, already up 76% over the past year, jumped another 8% in premarket trading on Friday.
We are from the planet Duplo. Following news last Friday that Elon Musk’s SpaceX was laying off 10% of its workforce comes news this Friday that Elon Musk’s Tesla is laying off 7% of its employees. In a memo to staff, Musk said the cuts were necessary as the electric carmaker moved to sell cheaper versions of its Model 3. “We need to reach more customers who can afford our vehicles,” Musk wrote. Tesla shares, up just 2% over the past year, slid 7% in premarket trading.
You don’t have to be the bad guy. Security experts discovered one of the largest caches of stolen online data in history. Dubbed “Collection #1,” the data includes 772,904,991 different email addresses and 21,222,975 unique passwords.
I think I just heard a whoosh. In a widely misunderstood story, watchmaker Fossil sold some secret new smartwatch tech to Google for $40 million. But Fossil will continue making its own lines of smartwatches and Google says it will make the new feature(s) available to all companies using its Wear OS platform. Thus the deal does not appear to be the basis for the elusive, imagined “Pixel watch” many Google fans crave.
Not so special anymore. Some Facebook employees got caught leaving 5-star reviews on Amazon for the social network’s Portal smart display. Oops. Facebook said the reviews were “neither coordinated nor directed from the company” and the employees would be asked to take them down.
(Headline reference explainer, if needed.)
FOR YOUR WEEKEND READING PLEASURE
The Most Powerful Person in Silicon Valley (Fast Company)
Billionaire Masayoshi Son–not Elon Musk, Jeff Bezos, or Mark Zuckerberg–has the most audacious vision for an A.I.-powered utopia where machines control how we live. And he’s spending hundreds of billions of dollars to realize it. Are you ready to live in Masa World?
Jack Dorsey Has No Clue What He Wants (Huffington Post)
A Q&A with Twitter’s CEO on right-wing extremism, Candace Owens, and what he’d do if the president called on his followers to murder journalists.
Lunch With the FT: Meg Whitman: ‘Businesses Need To Think, Who’s Coming To Kill Me?’ (Financial Times)
The tech executive on surviving disruption, politics as combat, and taking on Netflix.
The Man Behind Billionaires’ Row Battles to Sell the World’s Tallest Condo (Wall Street Journal)
Extell’s Gary Barnett remade Manhattan’s skyline and spurred a supertall-tower boom with One57. In a faltering real-estate market, he’s hoping to sell the ultra-rich on Central Park Tower.
FOOD FOR THOUGHT
We’ve been hearing for a while that the convergence of clever urban planning and the Internet of Things would bring about a new living space dubbed the smart city. Progress has been made in fits and starts. Tekla Perry reports for IEEE Spectrum on how San Diego is doing after installing thousands of data collection points on street lamps. So far, the city is simply counting vehicles and people as they pass by each sensor. Erik Caldwell, who has the full mouth title “interim deputy chief operating officer for smart and sustainable communities,” explains what may happen next:
“It’s not super-exciting yet in terms of applications from the outside looking in,” Caldwell says. “But it’s like we asked for a cold drink of water and got shot in the face with a firehose; it’s a matter of figuring out how we are going to take in this data, first using it internally, and then putting in policies and procedures to make it available for use by the public, including application developers.”
IN CASE YOU MISSED IT
Hate Assembling IKEA Furniture? This Robot Might Be Able to Help By Erin Corbett
BEFORE YOU GO
I don’t mean to end your Friday on a downer, but one of the plant species most at risk from global climate change is coffee. We may end up having to shift consumption to obscure beans from unusual breeds of coffee plants that have thrived in difficult conditions–if we can produce enough of them. So extra-savor that cup of jo over the long weekend, and we’ll see you back here on Tuesday.
Read more here:: fortune.com/tech/feed/
By A.R. Guess
According to a new press release, “Aerohive Networks™, a Cloud-Management leader, today announced the availability of cloud management for its A3 Secure Access Management solution. A3 brings a comprehensive, yet simplified approach to Corporate, BYOD, Guest, and IoT client device onboarding, authentication, and network access control (NAC). First launched in May 2018 as an on-premises […]
The post Aerohive Announces Microservices-based Cloud Management appeared first on DATAVERSITY.
Read more here:: www.dataversity.net/feed/