A Revolution in the Structure of your Financial News


In "The Structure of Scientific Revolutions" published in 1962, Thomas Kuhn challenged the view that the progress of science is nothing but the accumulation of new facts and theories. He demonstrated that for scientific progress to be possible, it is essential that old paradigms be replaced by new paradigms. The expression that would become a cliché, "the exponential progress in science and technology", did not exist yet. Three years later, in 1965, Moore stated his observation on the increase in the number of transistors in an integrated circuit, and what is now referred to as Moore's law was the first realisation that science and technology can indeed progress exponentially, and then similar laws have been stated and verified in the converging fields of Information Technology, Nanotechnology, Genomics, and Cognitive science. Concurrently, the amount of available data has been growing exponentially, with Big Data there to harness the growth. One can wonder what Kuhn would have said of modern science and technology. Is there still time for genuine paradigm shifts? Can exponential growth afford not to be almost exclusively cumulative? In the epistemological foundations of Big Data, correlations supplant laws, predictions replace explanations, power takes over meaning. Still, Big Data does not consecrate a paradigm shift, it is not a new paradigm replacing an old one. It is just one more set of concepts and techniques; accumulation again.

Most of those who reflect on the changes that will be brought to society by exponential scientific and technological progress, by Big Data, see a world where humans are getting more and more hopelessly incompetent. It seems inevitable that for most professions, professions that require a long education, long practice, high skills, professions such as surgery, successful applicants will be artificial systems, not human beings. When the most reliable diagnosis, the most effective treatment is already the output of an algorithmic procedure that processes more data than a traditional doctor can analyse working full time for over three years, those of tomorrow's doctors who want to keep their job can only know at least as much on Data mining and Big Data than on physiology and epidemiology. Are finance analysts promised to a similar fate? Without giving the question much thinking, the answer seems to be a definite yes. Financial news, financial data are massive. There is no Newtonian mechanics to model the evolution of stock prices, and the aim is not to really to understand, to give meaning; the objective is to maximise profits, to have good returns. But economics is a social science, one of the sciences with at the heart of its object of study, human attitudes, human behaviour, human perceptions, human decisions, human exuberance and human fits of panic. This could mean that economics is all the more a domain where one should seek correlations, not classical models, a domain where one should analyse huge quantities of data in a high dimensional space and do nothing but apply sophisticated statistics and Data mining techniques. But this could also mean that humans will not be deemed incompetent as quickly as in fields where their emotions, their desires, their ambitions, their fears play no role or should be controlled to the largest possible extent. As the economic activity is affected by virtually all aspects of human nature, as what drives and determines so much of the economic activity is deeply human in essence, it could be that for a long time, human beings will still be unbeatable at "feeling" it, despite all its complexity, despite the massive amounts of data that it generates. Big Data, data mining, statistics are indispensable, absolutely indispensable to today's financial analysts. But now, and probably for still a long time, they are tools at their disposal, only tools, to help them, not to defeat them.

To market analysts, to hedge fund managers, financial news is the key resource, and a source of input for Big Data analysis. Between the text that makes up a news and a probabilistic distribution of the value of a share for a company C, a basic correlation exists. That distribution is nontrivial only over a given interval, that spans from when the news starts to affect the share's value, some time after it has been publicised, to when the news has been totally consumed and has no more effect on the share's value. The less uncertainty on the distribution, the stronger the correlation. If the news is about a new product release from company C, the correlation might be strong. If the news is about a new product release from a different company, the correlation might still be strong, for instance because that company is one of C's competitors. If the news is about a natural disaster affecting the production units of a factory, then again the correlation might be significant, because that factory is one of C's suppliers. So in order to determine all potential basic correlations, one needs to identify the entities of a broad domain and build a graph of relationships, between a company and its competitors, between a company and its suppliers, between many more entities. One needs to be able to extract the semantics of a text, understand whether the news is about a merger and acquisition, whether it is an analyst recommendation. One needs to be able to identify the type of news, the sentiment it conveys. This is only to identify basic correlations, as sets of news, news considered together on a particular timeline, are then candidates for higher order correlations.

Our aim is to analyse financial news, in real time, and determine those which are strongly correlated to the distribution of values of the shares of the companies in an investor's portfolio; list those with the strongest correlations at the top; classify them along categories found out to be associated with correlations of some kind. Our aim is to understand the domain, to capture the relationships between entities, to infer and discover the most indirect correlations, those that cannot be discovered from the entities occurring in the texts of the news but being related to them. It is a big challenge, an exciting challenge, and we have tackled it with already, remarkable results that we invite you to explore.