If you don’t get a man page, install it with Check to see if ImageMagick is installed with Next we will want some command-line image processing software to manipulate page images. Sudo aptitude install openbox obconf obmenu Sudo aptitude install xorg xserver-xorg xterm If you don’t get a man page, you can install X Windows and Openbox with the following. Here I will be using a window manager called Openbox, but most of the commands should work fine with other Linux configurations. If you are working with a Linux distribution that does not already have a windowing manager or desktop environment installed, you will need one. In Linux, you can choose from a variety of window managers and desktop environments. With a GUI desktop, the expectation is that you will spend most of your time using a mouse for interaction (this is very familiar to users of Windows or OS X). Sometimes you use a mouse with a window manager, but most of your interactions continue to be at the command line. The former is a lightweight application that allows you to view and manipulate multiple windows at the same time the latter is a full-blown interface to your operating system that includes graphical versions of your applications. The standard Linux console does not have this facility, so we need to use a window manager or a GUI desktop environment. When working with page images, however, it is very useful to be able to see pictures. Using a window managerĪs with earlier posts, we are going to use command line tools to process our files. Older fonts and texts, or warped, indistinct or blurry page images often result in lower quality OCR. These will certainly have some errors, but the quality tends to be surprisingly good for clean scans of recently typed or printed pages. Starting with digital photographs or scans of documents, we can apply optical character recognition (OCR) to create machine-readable texts. In this post we focus on a preliminary issue: converting images of texts into text files that we can work with. In previous posts, we looked at a variety of Linux command line techniques for analyzing text and finding patterns in it, including word frequencies, permuted term indexes, regular expressions, simple search engines and named entity recognition.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |