Ease– A Software Agent That Extracts Financial Data from the Sec’...

ICSC interdisciplinary research	Fourth International ICSC Symposium on ENGINEERING OF INTELLIGENT SYSTEMS (EIS 2004) in collaboration with the University of Madeira Island of Madeira, Portugal February 29 – March 2, 2004


Session:	Knowledge Representation, Decision Support and Expert Systems Tuesday, March 02, 2004, 12.10 - 12.30
Session Chair: Vice Chair:	A. Dobnikar M. Savoji

Paper Title:	Ease– A Software Agent That Extracts Financial Data from the Sec’s Edgar Database

Author(s):	Prof. D. Seese, University Karlsruhe, Germany O. Cetinkaya, University Karlsruhe, Germany R. Spoeth, University Karlsruhe, Germany T. Stuempert, University Karlsruhe, Germany

Abstract:	In this paper we discuss text mining approaches for financial data from the Electronic Data Gathering, Analysis and Retrieval (EDGAR) database of the Securities and Exchange Commission (SEC) which contains filings including financial statements of about 68,000 companies. The structure of these filings varies between companies, and changes over time for individual companies as well. Moreover, their technical specification is comparably weak. Altogether, this makes automated data extraction a great challenge for software agents. The focus of this paper was the recognition of balance sheets, that is, how to find relevant sections in a large document. A filing consists of HTML or plain text. With respect to this distinction, we followed two different approaches for the respective types. Regarding HTML encoded content the agent builds a DOM (Document Object Model) instance on top of non-standard filing, which allows very convenient data access. This DOM-based approach revealed additional potential for navigation in these filings in order to detect financial information faster and more reliably even when filings do not adhere to syntactical conventions strictly. For plain text, a modified vector space model has been developed. We succeeded to extract key financial information at a reasonably high level for conventional text files.


	CD-ROM Produced by X-CD Technologies