Transformer-Models for Translating Tax Code

Submission information

Submission Number: 158

Submission ID: 1960

Submission UUID: 4f15ce83-42d1-4b31-9579-2d57477cc2bb

Submission URI: /form/project

Created: Wed, 11/16/2022 - 15:58

Completed: Wed, 11/16/2022 - 17:21

Changed: Wed, 09/04/2024 - 15:39

Remote IP address: 130.132.173.171

Submitted by: Andrew Sherman

Language: English

Is draft: No

Webform: Project

Project Title Transformer-Models for Translating Tax Code

Program CAREERS

Project Image

Tags ai (271), natural-language-processing (274), python (69)

Status Halted

Project Leader

Project Leader Phillip Bradford

Email phillip.bradford@uconn.edu

Mobile Phone {Empty}

Work Phone {Empty}

Project Personnel

Mentor(s) Henry Orphys

Student-facilitator(s) Adham Khalifa

Mentee(s) {Empty}

Project Information

Project Description Opportunity: Recent breakthroughs with NLP Transformers, BERT and GPT models [GG2020], we look to advance the translation of tax code to ErgoAI (Prolog). There has been work applying BERT-based models to legal translation to Prolog [HBD2020]. We will work to extend the depth and breadth of applying Transformers, BERT and or GPT models to translating tax code to ErgoAI (Prolog).

The U.S. tax code and the Connecticut State Tax Code are both available online in structured formats. We have already successfully captured all U.S. tax code and Connecticut State tax code. We are preparing to translate it into ErgoAI (Prolog). This is challenging due to the complexity of English and tax law.

There has been recent work on translating English text to Prolog rules. Some of this has been done using Controlled Natural Language (CNL), see [GFK2018, KD2022]. In some cases, it has also been done without any CNL [WBGFK2022]. Unfortunately, getting the tax code into CNLs seems challenging – comparable to translating the tax code directly to Prolog (ErgoAI). We want to automate this process. See [VSPUJGKP2017] and [MMP2021, MCP2021]. ErgoAI was developed and is maintained by Coherent Knowledge.

Transformer Models: In the last year, with Krutika Patel and the generous support of the Yale University, Center for Research Computing, CAREERS, we found several patterns in legal text. These patterns allowed us to selectively translate minor sections of the law into logic. However, to make this practical, we need to scale the translation a great deal. Our hope is that applying Transformer-based models will allow us to sufficiently scale the translation from U.S. and Connecticut tax code to ErgoAI (Prolog) to significantly enhance hand translation.

Project Information Subsection

Project Deliverables We expect to deliver transformer based translation systems. These translation systems will first translate from Connecticut tax code to ErgoAI (Prolog). The Connecticut tax code has under 1,500 regulations.

Once we understand the Connecticut tax code translation, we will refine the transformer-based tools on U.S. tax code. The U.S. tax code is much larger than the Connecticut tax code. So its automatic translation is more important.

In the process of building the tax transformer-based models, we will author a research paper that surveys the opportunities and outlines our findings.

We will post all deliverables to public github repositories.

Selected References:

[KD2022] R. Kowalski, A. Datoo. Logical English meets legal English for swaps and derivatives. Artif Intell Law 30, 163–197 (2022). https://doi.org/10.1007/s10506-021-09295-3

[GFK2018] Tiantian Gao, Paul Fodor, Michael Kifer. Knowledge Authoring for Rule-Based Reasoning, ODBASE, OTM Conferences 2018: 461-480

[GG2020] Benyamin Ghojogh and Ali Ghodsi. "Attention mechanism, transformers, BERT, and GPT: Tutorial and survey." (2020).

[HBD2020] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. "A dataset for statutory reasoning in tax law entailment and question answering." arXiv preprint arXiv:2005.05257 (2020).

[KD2022] Robert Kowalski, and Akber Datoo. "Logical English meets legal English for swaps and derivatives." Artificial Intelligence and Law 30, no. 2 (2022): 163-197.

[MMP2021] Denis Merigoux, Raphaël Monat, and Jonathan Protzenko. "A modern compiler for the French tax code." In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction, pp. 71-82. 2021.

[MCP2021] Denis Merigoux, Nicolas Chataing, and Jonathan Protzenko. "Catala: a programming language for the law." Proceedings of the ACM on Programming Languages 5, no. ICFP (2021): 1-29.

[TWW] Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Natural language processing with transformers. O'Reilly Media, Inc., 2022.

[WBGFK2022] Yuheng Wang, Giorgian Borca-Tasciuc, Nikhil Goel, and Paul Fodor and Michael Kifer “Knowledge Authoring with Factual English”, 5-August, 2022. https://arxiv.org/abs/2208.03094v1

[VSPUJGKP2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Project Deliverables {Empty}

Student Research Computing Facilitator Profile Ideally, the student should know Python or similar programming languages.

They should have interest in NLP (Natural Language Processing) and its application.

They should also be able to learn to work with one of several Python NLP libraries such as NLTK ( https://realpython.com/nltk-nlp-python/ )

Mentee Research Computing Profile {Empty}

Student Facilitator Programming Skill Level Some hands-on experience

Mentee Programming Skill Level {Empty}

Project Institution University of Connecticut - Stamford

Project Address Stamford, Connecticut

Anchor Institution CR-Yale

Preferred Start Date 01/09/2023

Start as soon as possible. No

Project Urgency Already behind3Start date is flexible

Expected Project Duration (in months) 6

Launch Presentation {Empty}

Launch Presentation Date {Empty}

Wrap Presentation {Empty}

Wrap Presentation Date {Empty}

Project Milestones

Milestone Title: Connecticut Tax Code paired with Logic outputs
Milestone Description: General comment about these milestones: We do not expect to get ideal translations from tax code to logic (Prolog/ErgoAI).

We have Python scripts that pull the Connecticut Tax Code (CTC). Pair select CTC inputs and their desired Prolog/ErgoAI outputs for training. Determine the quality of our translations on CTC.
Completion Date Goal: 2023-02-10
Milestone Title: Modeling intermediate representation using CTC and Logic
Milestone Description: Encoder/decoder mapping between CTC and ErgoAI/Prolog. This will have integer vector representation of the intermediate values
Completion Date Goal: 2023-03-10
Milestone Title: Using BERT (Bidirectional encoder transformation) to try to improve mapping CTC to logic
Milestone Description: Full BERT models – see if we can tune these with the previous two deliverables
Completion Date Goal: 2023-04-14
Milestone Title: US Federal Code paired with Logic outputs
Milestone Description: We have Python scripts that pull the Federal Tax Code (FTC). Pair select FTC inputs and their desired Prolog/ErgoAI outputs for training. Determine the quality of our translations on FTC. Build on the learning from the CTC.
Completion Date Goal: 2023-05-12
Milestone Title: Modeling intermediate representation using FTC and Logic
Milestone Description: Encoder/decoder mapping between FTC and ErgoAI/Prolog. This will have integer vector representation of the intermediate values. We look to build on our learning from the simpler and much smaller CTC.
Completion Date Goal: 2023-06-16
Milestone Title: Using BERT to try to improve mapping FTC to logic
Milestone Description: Full BERT models – see if we can tune these with the previous two deliverables – and build on the CTC.
Completion Date Goal: 2023-07-14

Github Contributions {Empty}

Planned Portal Contributions (if any) {Empty}

Planned Publications (if any) {Empty}

What will the student learn? The student will learn about transformer models such as BERT.
The student will learn about ErgoAI.
The student will learn some data architecture.
The student will learn to organize the legal text for storage to make retrieval easy and mapping easy to either ErgoAI or Catala-lang.
This transformed/organized tax code will be stored in a relational database such as MySQL.
The student will learn SQL and how to interact with a relational database through a database workbench.
The student will learn how to work with a relational database from a language like Python.

What will the mentee learn? {Empty}

What will the Cyberteam program learn from this project? {Empty}

HPC resources needed to complete this project? Unknown at this time

Notes {Empty}

Final Report

What is the impact on the development of the principal discipline(s) of the project? {Empty}

What is the impact on other disciplines? {Empty}

Is there an impact physical resources that form infrastructure? {Empty}

Is there an impact on the development of human resources for research computing? {Empty}

Is there an impact on institutional resources that form infrastructure? {Empty}

Is there an impact on information resources that form infrastructure? {Empty}

Is there an impact on technology transfer? {Empty}

Is there an impact on society beyond science and technology? {Empty}

Lessons Learned {Empty}

Overall results {Empty}