Blogger Template by Blogcrowds

pdftabextract - A set of tools for data mining (OCR-processed) PDFs


Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

https://github.com/WZBSocialScienceCenter/pdftabextract


grupoed2k@localhost ~]# home

Entrada más reciente Entrada antigua Página principal