|
Introduction
A typical problem which regex is particularly suited to, is that of importing data into a spreadsheet. For compatibility, it is often the case that spreadsheets published for distribution will be presented in .pdf or Adobe Acrobat Portable Document Format. From the perspective of distribution this is an ideal format.
When publishing, a key consideration is that of the person who might view the published data. What sort of computer might they have? What applications might that person have on their computer? In most cases, it's not going to be possible to determine who the user is, let alone what sort of computer, or software they have. In this sense Acrobat is the ideal format, and this is the task for which the format was primarily designed.
With all this in mind, distributing files in .pdf format is not a bad idea, but when it comes to tabular data, it's not so clear. Acrobat has been designed for compatibility, and as such at it's core, the .pdf format aims to store a document in at least two different formats, graphical, and text. Typically most applications, spreadsheets included, are graphical. When we open the .pdf document, it's the graphical page that we see.
For casual use then the graphical image of a spreadsheet is probably enough, especially if the analysis on the spreadsheet data has already been done and graphs given in the .pdf. Often it is the case that the spreadsheet presented as a .pdf is just tabular data. The basic intent of the publisher was, perhaps, that the reader would do their own analysis. The trouble in this case is that because the common data in the .pdf file is graphical, there is no obvious way to get it out of the .pdf, and into a spreadsheet for analysis.
To satisfy this problem the .pdf contains a text representation of the image data that it contains. For our spreadsheet case this text data does not come without problems. The issue stems from the fact that the .pdf format has been designed not just to deal with spreadsheets, but also any graphical data format. Typically this might include block diagrams, where the blocks contain text. The .pdf format associates the text it contains with the image elements in the graphical image.
This association is great for text search within the .pdf document for example. It means that you can type search data into Adobe Acrobat and be taken directly to the graphical image represented by that text. For our spreadsheet problem, however, things are more difficult. When the original spreadsheet (which we don't have) was used to generate the .pdf document, the infomation we now need to recreate the spreadsheet was stripped before being absorbed by the .pdf document. This stripping action satisfies the need of the .pdf file to be able to associate simple text strings with graphical image objects in the .pdf file, but it damages the relationship between the text strings which was important in the original spreadsheet.
It's not all bad however. Because the text is in the .pdf file, we can get it, and work it back into a format suitable for use in virtually any spreadsheet, using regular expressions. This format is widely known as .csv, or Comma Separator Value. Csv is a very simple format, for transmitting tabular data. In simple terms a row is represented on a line of text, and cells within the row ar separated by commas, hence the name.
The problem then is to get the text produced from the .pdf file, into .csv. We've produced a .zip file containing resources we'll use and produce in this example, RegexExample.zip. |