One of the most interesting problem areas has always been around market data. What are the prices? What were they? Who bought what? Who's selling and why? How does the back testing look?
As trading technology advances, the number of messages grow. As the messages grow, volumes follow. Volumes lead to more quotes. Market segmentation encourages more quoting venues. The world of trading has become the world of the best information in the fastest times … to receive it and act upon it.
Today’s markets throw off an ever-increasing amount of data that always contains a relative advantage to those who can fully absorb, analyze, understand and mine it. As volatility spikes become the norm, trading segments seek to analyze more events, compare venues with regions and assets to trade. As such, market data, the normalization of it and the storage and the analysis of it puts more and more stress on the enterprise. Not to mention the integrity of the analytics.
The current state of the art is that data is brought to the analytics. Data is kept somewhere deep in an archive, a database or storage unit. It is then removed and brought to the analytics in piecemeal to determine risk, build strategy, back test the strategy, view compliance metrics and on down to proper settlement.
Ideally, faster systems and/or larger capacity systems now available can help with growing data sets but moving data across the enterprise continues to put so much strain and latency on the process that most data is actually discarded and new data used, thus sacrificing information for performance.
Further, storage capacities have to be built out to allow mining of the growing “big data,” with average costs well above $100,000 per terabyte. Then there’s the movement of data across the LAN and the WAN that adds pressure to the bottom line in cost, time delay and data integrity.
Those who are able to archive and replay today’s market prices and run analysis on that info can easily be looking at tens of millions annually to buy, store, recall, replay, move, distribute and test with no guarantee of a profit. The tradeoff option is do it slower, take longer and analyze smaller sets at the risk of being last in the marketplace.
Analytics in and of themselves are by definition increasingly complex sets of patterns and models used in all aspects of finance. It’s arguably an art of designing interfaces and winning computations that make the difference to what is used successfully and what is not. Still, the size and depth make the opportunity of value moot.
Sometimes the best analytics can be held hostage to bottlenecks of technology that prohibit the data’s integrity, accuracy and timely arrival. However, if the analytics reside where the data is stored, a plethora of problematic implications dissipate. The accuracy and integrity of the solution is thus sharpened. The need to query data becomes almost obsolete as the data provides live results with constant computations running.
The bottom line is that putting powerful predictive analytics, which require no programming or quantitative teams, on the desktops of decision makers creates results heretofore only available to the largest of companies. The advent of in-database analytics offer this at a fraction of the cost of traditional solutions. In fact, we are currently seeing a 10:1 return on investment for our clients.
Why move the data to the analytics when you can move the analytics to the data? Analyzing large volume of data presents numerous challenges – it is time consuming, very expensive and requires management of complex technology infrastructure. In traditional approaches for analyzing data, end-users must move data into memory for processing. This activity accounts for up to 75 percent of the cycle time and imposes severe constraints on delivery of results. In addition, the client or server where the processing is done must have enough memory to store the data and intermediate results.
Fuzzy Logix, founded by investment bankers who have constructed highly functional and performance-driven analytics for a variety of asset classes and risk measurements, started the company for others with similar data-management and analytical bottlenecks.
In these efforts a series of algorithms, functions and computations were compiled as a library called DB Lytix.
The joint efforts by database, data warehousing and business intelligence technology are an example of the two worlds – business and technology – coming closer together. Some estimates indicate predictive analytics will account for 30 percent of all analytics by 2014. Of course these are, well, predictions, but as long as the data continues to compile together with the analytics, there are no limitations.
Comments | Post a Comment
4 Comments to "No More Limitations: Data Analytics":
louislovas
23 March 2011
Nice article to define a problem space. The marriage of analytics and data is at the heart of the technology from OneMarketData. Putting the analytics as 'close to the metal' as possible reduces latency, network bandwidth and increases performance of trading strategies and quant research. What you describe is the core reason why relational databases are a poor choice for the quantitative world - no inbuilt analytics (unless you consider SUM,AVE analytics). Thus making all true analysis external to data. This is where and why the latency/performance penality (not to mention custom coding efforts) are incurred. Real 'tick' engines such as OneMarketData's OneTick product provide true analytical processing inside the data engine - both historical and live markets. So quant strategies, backtesting can take advantage of nanosecond performance.
Comments (29)
gvalente
23 March 2011
While I agree that some databases are a poor choice "Big Data" and some a poor choice for "analytics" and an even more of them are a poor choice for "Big Data Analytics", I think you might have thrown the baby out with the bath water here. Many MPP Relational databases include inbuilt analytics these days - Netezza, Aster, and my company XtremeData all do it in a large scale MPP fashion from 1TB to 10PB in size. Analytics at the data with all cores (1000's of them working in parallel on their part of the data without extracting it from the DB itself). So what do you mean exactly? For example, we all support the ability to use SQL to call a built-in analytical function library in the database (Like KXEN, SAS, R, and/or Fuzzy). You, the user, do not have to know anything new, but simply ask the question you want to ask. These are standard languages and Tools - R is now the most popular statistical language and SQL has been used for 20+ years as the defacto standard. SAS/KXEN/Fuzzy bring analytical functions to the table as a library so you just call it with a user defined function - it is very simple and not a "custom coding effort" at all. Now, we each have our own features and benefits, but saying RDBMS is a poor choice is simple not true anymore and hasn't been for 3-5 years. If anything, I would say that OneTick is a custom coding effort, doesn't linearly scale to "big data", and has a fixed defined by OneTick "schema". I'm not a OneTick expert, but what I've seen and competed against is OneTick is the KING at anything that fits in memory / CEP and falls apart at anything larger than one server aka your OneQuantData product at large scale (full depth of book types of equities / options data). Maybe that has changed and I'll let others jump in, but I'm pretty sure you've overlooked the past 5 years of innovation in the "Big Data Analytics" space with your comments above. gv: twitter/xtremedata
Comments (1)
louislovas
24 March 2011
I’ve always loved that phrase; “throw the baby out with the bath water” it conjures up all sorts of mental images. OneQuantData is a reference data product, something completely different than OneTick. OneTick scales and quite massively, we have numerous customers running Options market making across multiple servers. The product easily leverages multiple cores to handle the OPRA firehose. Data can be partitioned or duplicated across servers so full shared nothing or shared everything is possible and user queries can be parallel processed across cores - either on single machine or multiple machines. There is complete flexibility with server deployments. Schema creation is, and always has been completely flexible and user definable. We do ship a number of pre-defined schemas for trades, quote and orderbook types because it’s the most often used in our customer’s business – why not. We make the most efficient use of memory, and like any system the more the better. User queries have transparent access to data, it can reside either in-memory, in archival databases, or streaming live. A query’s time frame can span from years in the past to the present and into the future. The OneTick architecture will manage the blending of data from on-disk, in-memory and live connections. In finance if you store time-series tick data, it means users will demand that you also handle corporate actions, futures contract rolls, corrections and cancellations without undo data management/manipulation headaches. This is all a natural part of OneTick since it sole purpose is financial data modeling. In addition to our own analytical library of over 100 functions highly purposed for the financial markets, R and MATLAB are directly integrated within our server. All these functions can be applied to any schema type in user queries – even a complex structure like an order book. Creative innovation by Vertica, Netezza, XtremeData and others has brought relational engines out of the stone-age, it’s about time. Yet such technology is more general-purpose by-design. Those new vendors, make great Oracle-busters since they can drop-replace them and likely be orders-of-magnitude faster… for some shipping company’s order-entry application. But effectively managing financial data is a horse of different color, general-purpose database technology doesn’t play.
Comments (29)
dwatkins
28 March 2011
Thank you for your comments and feedback Luis!! I have learned a few things, so thank you! However, I do have to say, that even though the large data-warehouse companies offer general purpose computing, where they differ even more over the traditional db's are in sheer speed and capacity. Many have allowed for Fuzzy Logix to embed it's in-database analytics to co-exist in code with some of the faster dw's out there, and have brought time process times from days as low as seconds. The real-time world can still only swallow a bite size at a time, a terabyte or two, before discarding valuable bits of information which could be used later. The large 30 - 50 terabyte dw's no longer having to discard old data but create a historical db and add regularly are a next gen technology most will adopt soon enough. Plus, the db's and the dw's out here, have so many more compute cores, calculations can run much faster and with more data to sniff, more accurately..
Comments (8)