Здесь можно что-то разместить



    Здесь можно что-то разместить



    Convert Scanned Pdf To Image Python


    Convert scanned pdf to image python

    img to iso linux

    This article introduces how to setup the denpendicies and environment for using OCR technic to extract data from scanned PDF or image.

    extracting normal pdf is easy and convinent, we can just use pdfminer and pdfminer.six (for python2 and python3 respectively) and follow the instruction to get text content. But for those scanned pdf, it is actually the image in essence. To extract the text from it, we need a little bit more complicated setup. In addition, it is easy for linux system but hard for windows system.

    Basic package and software needed

    We want to use pyocr to extract what we need. And in order to use if correctly, we need the following important denpendencies

    • Python Imaging Library (PIL)
    • Wand
    • tesseract-ocr
    • ghostscript
    • ImageMagick

      Note that PIL could use conda install pil. And also we need to setup the environment and path.First of all, do not change the default name of the folder, you can change the directory. But if you change the directory, you need to change some path setup from tesseract.py.py in pyocr package.

      For the system path and environment, you need to add the directory of ghostscript, ImageMagick, tesseract-ocr into system path:

    • create a new name MAGICK_HOME and set ImageMagick,ghostscript as E:\system\ImageMagick-6.9.7-Q8; E:\system\gs9.20\bin

    • add them into the path E:\system\ImageMagick-6.9.7-Q8; E:\system\gs9.20\bin
    • create a new name TESSDATA_PREFIX and set tesseract directory E:\system\Tesseract-OCR
    • change the tesseract.py as

      123456
      # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLYTESSERACT_CMD = os.environ["TESSDATA_PREFIX"]+ os.sep +'tesseract.exe' if os.name == 'nt' else 'tesseract'TESSDATA_EXTENSION = ".traineddata"logger = logging.getLogger(__name__)

    environment

    when you successfully setup, you can open the cmd, and input :convert filename.pdf filename.jpgto see whether it can operate correctly.

    python OCR stript

    When all those are done. We are able to write the python script :

    • importing the required libraries:
    12345
    from wand.image import Imagefrom PIL import Image as PIimport pyocrimport pyocr.buildersimport io
    • get the handle of the OCR library (tesseract)
    12
    tool = pyocr.get_available_tools()[0]lang = tool.get_available_languages()[0]  # you need to check what the language is in the list, in my computer it is eng for [0]

    If your tesseract does not setup correctly, you will encount null value in this part, please carefully check the environment path setup.

    • setup two lists to store the images and final_text
    12
    req_image = []final_text = []
    • open the PDF file using wand and convert it to jpeg
    12
    image_pdf = Image(filename="path/filename.pdf", resolution=300)image_jpeg = image_pdf.convert('jpeg')

    If the ghostscript does not setup correctly, this part will raise the error, usually I encounter 798 : the system could not find the file. Here you need not only check the environment path but also do not change the folder’s name, because I change the folder’s name at the beginning, It tooks me a long time to fix this problem.

    wand has converted all the separate pages in the PDF into separate image blobs. We can loop over them and append them as a blob into the req_image list.

    123
    for img in image_jpeg.sequence:    img_page = Image(image=img)    req_image.append(img_page.make_blob('jpeg'))
    • run OCR to get the text
    1234567
    for img in req_image:     txt = tool.image_to_string(        PI.open(io.BytesIO(img)),        lang=lang,        builder=pyocr.builders.TextBuilder()    )    final_text.append(txt)

    It will take a few minuite to finsih the converting.

    Full code

    The Full code is

    1234567891011121314151617181920212223242526272829303132
    # -*- coding: utf-8 -*-python 27required package pyocr, PIL, wandfrom wand.image import Imagefrom PIL import Image as PIimport pyocrimport pyocr.buildersimport iopath = "your path directory\demo.pdf"tool = pyocr.get_available_tools()[0]lang = tool.get_available_languages()[0] // 0 is engreq_image = []final_text = []image_pdf = Image(filename=path, resolution=300)image_jpeg = image_pdf.convert('jpeg')for img in image_jpeg.sequence:    img_page = Image(image=img)    req_image.append(img_page.make_blob('jpeg'))for img in req_image:     txt = tool.image_to_string(        PI.open(io.BytesIO(img)),        lang=lang,        builder=pyocr.builders.TextBuilder()    )    final_text.append(txt)

    reference

    • OCR on PDF files using Python
    • extracting normal PDF using pdfminer



    Просмотров - 2757 / Комментариев - / Автор - admin / Добавлено - {date=d-m-Y H:i} / / Категория: Новости /
    1494



    Мы работаем над тем, чтобы самое новое и свежее порно видео всегда было представлено в большом количестве на нашем сайте.