Result of ServiceAll products specified herein must be prepared considering the activities described in the previous item. Deliveries must be validated by the consultancy's supervision and, if necessary, revised versions must be resubmitted. PRODUCT 1: Technical report containing a survey of the formats and structures of state reports. The document shall consolidate a mapping of the formats (PDF, DOCX, internal systems, etc.), standards, and information fields used in each federal unit (UF) and in the Federal Police. It shall also contain a comparative table between the UFs and between the UFs and the Federal Police, highlighting similarities and challenges. PRODUCT 2: Technical document containing a map of variables of interest and initial work taxonomy. PRODUCT 3: Technical document containing a methodological proposal for data extraction and structuring. The document should detail the use of OCR, NLP, and standardisation techniques. If possible, it should compare optical character recognition (OCR) results obtained with traditional approaches and with LLMs. It should also contain an illustrated workflow (data pipeline) and the quality and validation criteria for extractions. PRODUCT 4: Technical document containing code (in Python or R, for example) for extracting data from a sample set of reports. The code delivered must have already been tested in different file formats. The document must also contain step-by-step instructions for applying the code. PRODUCT 5: Technical document containing the results of the extraction from real samples of reports from some states and the Federal Police. It should contain an assessment of accuracy, limitations, and necessary adjustments. PRODUCT 6: Technical document containing code revised after sample pre-testing and scripts and technical documentation for the OCR and NLP modules adapted to forensic reports. The document should include text pre-processing (cleaning, tokenisation, and data normalisation). PRODUCT 7: Unified database with information extracted from reports with a relational or document-oriented structure, ready for analysis. It should include data from all states and the Federal District. PRODUCT 8: Technical document containing a dictionary of standardised variables and metadata, with a clear definition of each variable, format, unit of measurement and rules for completion, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). PRODUCT 9: Document containing a technical manual and final report on the methodology for replication, including instructions for using the tools, maintenance, and updating. It should also contain an analysis of challenges and recommendations for future expansion. Work LocationHome based Expected duration02.2026 - 01.2027 Duties and ResponsibilitiesThe consultant will participate in and perform the following technical activities described below: โข Survey the formats, standards, and information fields existing in expert reports and state and Federal Police data entry systems. โข Map the main variables of interest. โข Develop a methodological proposal for data extraction and structuring, including the use of Artificial Intelligence and Natural Language Processing (NLP) tools. โข Perform automated data extraction from a sample set of reports. โข Implement Optical Character Recognition (OCR) and NLP techniques. โข Build a database with the extracted data. โข Create a dictionary of variables and metadata, including details in formats such as JSON Schema (for use in OpenAI and Ollama APIs, for example). โข Write manuals and documents detailing the methodology for replication and memory. โข Participate in periodic alignment meetings with the PNIDD and UNODC teams, reporting on progress, challenges, and necessary adjustments to the implementation schedule. โข Deliver the finalised and approved products in the established formats (.py or .zip for codes and scripts; PDF for reports and documentation), observing the defined deadlines and quality requirements. Qualifications/special skillsAn advanced university degree (Masterโs degree or equivalent) in Computer Science, Computer Engineering, Software Engineering, Data Science, Artificial Intelligence, Statistics, Economics, Social Sciences, or a related field is required. A first-level university degree in a similar field, in combination with two additional years of qualifying experience, may be accepted in lieu of the advanced university degree. โข One (1) year of proven experience in structured data extraction and processing (text, PDF, images) is required. โข Experience with Python or R alongside libraries focused on data extraction (e.g., pdfminer, pyMuPDF, and pandas for Python, and tesseract, tm, and/or stringi for R) is desirable. โข Experience in applying AI, OCR, and/or NLP for text mining is desirable. โข Experience with database integration and modelling (SQL, NoSQL, APIs, etc.) is desirable. LanguagesEnglish and French are the working languages of the United Nations Secretariat. For this position, fluency in Portuguese, with oral and written proficiency, is required. Working knowledge of English is required. Knowledge of another United Nations official language is an advantage. Additional InformationNot available. No FeeTHE UNITED NATIONS DOES NOT CHARGE A FEE AT ANY STAGE OF THE RECRUITMENT PROCESS (APPLICATION, INTERVIEW MEETING, PROCESSING, OR TRAINING). THE UNITED NATIONS DOES NOT CONCERN ITSELF WITH INFORMATION ON APPLICANTSโ BANK ACCOUNTS.
