As the world moves ever more digital, many businesses have a need for automated processing of documents. In this webinar, we’ll walk through an example end-to-end project for extracting, classifying and summarising PDF documents, and show how you can use a combination of cutting-edge open-source technologies, together with your own in-house expertise and requirements, to build you own PDF Processor with Dataiku DSS.
PDF2Image (https://pypi.org/project/pdf2image/)
Tesseract OCR (https://tesseract-ocr.github.io/tessdoc/Home.html)
Pytesseract (https://pypi.org/project/pytesseract/#description)
The Plugin Store (https://www.dataiku.com/product/plugins/)
The Text Summarisation Plugin (https://www.dataiku.com/product/plugins/text-summarization/)
Sci-kit Learn 20 Newsgroups Dataset (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#)
"Surprising Findings in Document Classification" (https://towardsdatascience.com/surprising-findings-in-document-classification-7a79e30f1666)
Webinar (tomorrow): How to Reduce Data Labelling Costs (+ Increase Data Quality) With Active Learning (https://www.brighttalk.com/webcast/17108/394533?utm_campaign=channel-feed&utm_source=brighttalk-portal&utm_medium=web)