SORT IT: Build a PDF Processor

Logo
Presented by

Adam Jelley, Data Scientist

About this talk

As the world moves ever more digital, many businesses have a need for automated processing of documents. In this webinar, we’ll walk through an example end-to-end project for extracting, classifying and summarising PDF documents, and show how you can use a combination of cutting-edge open-source technologies, together with your own in-house expertise and requirements, to build you own PDF Processor with Dataiku DSS. PDF2Image (https://pypi.org/project/pdf2image/) Tesseract OCR (https://tesseract-ocr.github.io/tessdoc/Home.html) Pytesseract (https://pypi.org/project/pytesseract/#description) The Plugin Store (https://www.dataiku.com/product/plugins/) The Text Summarisation Plugin (https://www.dataiku.com/product/plugins/text-summarization/) Sci-kit Learn 20 Newsgroups Dataset (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#) "Surprising Findings in Document Classification" (https://towardsdatascience.com/surprising-findings-in-document-classification-7a79e30f1666) Webinar (tomorrow): How to Reduce Data Labelling Costs (+ Increase Data Quality) With Active Learning (https://www.brighttalk.com/webcast/17108/394533?utm_campaign=channel-feed&utm_source=brighttalk-portal&utm_medium=web)

Related topics:

More from this channel

Upcoming talks (5)
On-demand talks (464)
Subscribers (52728)
Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for exceptional business results. Organizations that use Dataiku elevate their people (whether technical and working in code or on the business side and low- or no-code) to extraordinary, arming them with the ability to make better day-to-day decisions with data. More than 450 companies worldwide use Dataiku to systemize their use of data and AI, driving diverse use cases from fraud detection to customer churn prevention, predictive maintenance to supply chain optimization, and everything in between.