Opportunity: Recent breakthroughs with NLP Transformers, BERT and GPT models [GG2020], we look to advance the translation of tax code to ErgoAI (Prolog). There has been work applying BERT-based models to legal translation to Prolog [HBD2020]. We will work to extend the depth and breadth of applying Transformers, BERT and or GPT models to translating tax code to ErgoAI (Prolog).
The U.S. tax code and the Connecticut State Tax Code are both available online in structured formats. We have already successfully captured all U.S. tax code and Connecticut State tax code. We are preparing to translate it into ErgoAI (Prolog). This is challenging due to the complexity of English and tax law.
There has been recent work on translating English text to Prolog rules. Some of this has been done using Controlled Natural Language (CNL), see [GFK2018, KD2022]. In some cases, it has also been done without any CNL [WBGFK2022]. Unfortunately, getting the tax code into CNLs seems challenging – comparable to translating the tax code directly to Prolog (ErgoAI). We want to automate this process. See [VSPUJGKP2017] and [MMP2021, MCP2021]. ErgoAI was developed and is maintained by Coherent Knowledge.
Transformer Models: In the last year, with Krutika Patel and the generous support of the Yale University, Center for Research Computing, CAREERS, we found several patterns in legal text. These patterns allowed us to selectively translate minor sections of the law into logic. However, to make this practical, we need to scale the translation a great deal. Our hope is that applying Transformer-based models will allow us to sufficiently scale the translation from U.S. and Connecticut tax code to ErgoAI (Prolog) to significantly enhance hand translation.
Project Information Subsection
We expect to deliver transformer based translation systems. These translation systems will first translate from Connecticut tax code to ErgoAI (Prolog). The Connecticut tax code has under 1,500 regulations.
Once we understand the Connecticut tax code translation, we will refine the transformer-based tools on U.S. tax code. The U.S. tax code is much larger than the Connecticut tax code. So its automatic translation is more important.
In the process of building the tax transformer-based models, we will author a research paper that surveys the opportunities and outlines our findings.
We will post all deliverables to public github repositories.
Selected References:
[KD2022] R. Kowalski, A. Datoo. Logical English meets legal English for swaps and derivatives. Artif Intell Law 30, 163–197 (2022). https://doi.org/10.1007/s10506-021-09295-3
[GFK2018] Tiantian Gao, Paul Fodor, Michael Kifer. Knowledge Authoring for Rule-Based Reasoning, ODBASE, OTM Conferences 2018: 461-480
[GG2020] Benyamin Ghojogh and Ali Ghodsi. "Attention mechanism, transformers, BERT, and GPT: Tutorial and survey." (2020).
[HBD2020] Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. "A dataset for statutory reasoning in tax law entailment and question answering." arXiv preprint arXiv:2005.05257 (2020).
[KD2022] Robert Kowalski, and Akber Datoo. "Logical English meets legal English for swaps and derivatives." Artificial Intelligence and Law 30, no. 2 (2022): 163-197.
[MMP2021] Denis Merigoux, Raphaël Monat, and Jonathan Protzenko. "A modern compiler for the French tax code." In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction, pp. 71-82. 2021.
[MCP2021] Denis Merigoux, Nicolas Chataing, and Jonathan Protzenko. "Catala: a programming language for the law." Proceedings of the ACM on Programming Languages 5, no. ICFP (2021): 1-29.
[TWW] Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Natural language processing with transformers. O'Reilly Media, Inc., 2022.
[WBGFK2022] Yuheng Wang, Giorgian Borca-Tasciuc, Nikhil Goel, and Paul Fodor and Michael Kifer “Knowledge Authoring with Factual English”, 5-August, 2022. https://arxiv.org/abs/2208.03094v1
[VSPUJGKP2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
{Empty}
Ideally, the student should know Python or similar programming languages.
They should have interest in NLP (Natural Language Processing) and its application.
They should also be able to learn to work with one of several Python NLP libraries such as NLTK ( https://realpython.com/nltk-nlp-python/ )
{Empty}
Some hands-on experience
{Empty}
University of Connecticut - Stamford
Stamford, Connecticut
CR-Yale
01/09/2023
No
Already behind3Start date is flexible
6
{Empty}
{Empty}
{Empty}
{Empty}
Milestone Title: Connecticut Tax Code paired with Logic outputs Milestone Description: General comment about these milestones: We do not expect to get ideal translations from tax code to logic (Prolog/ErgoAI).
We have Python scripts that pull the Connecticut Tax Code (CTC). Pair select CTC inputs and their desired Prolog/ErgoAI outputs for training. Determine the quality of our translations on CTC. Completion Date Goal: 2023-02-10
Milestone Title: Modeling intermediate representation using CTC and Logic Milestone Description: Encoder/decoder mapping between CTC and ErgoAI/Prolog. This will have integer vector representation of the intermediate values Completion Date Goal: 2023-03-10
Milestone Title: Using BERT (Bidirectional encoder transformation) to try to improve mapping CTC to logic Milestone Description: Full BERT models – see if we can tune these with the previous two deliverables Completion Date Goal: 2023-04-14
Milestone Title: US Federal Code paired with Logic outputs Milestone Description: We have Python scripts that pull the Federal Tax Code (FTC). Pair select FTC inputs and their desired Prolog/ErgoAI outputs for training. Determine the quality of our translations on FTC. Build on the learning from the CTC. Completion Date Goal: 2023-05-12
Milestone Title: Modeling intermediate representation using FTC and Logic Milestone Description: Encoder/decoder mapping between FTC and ErgoAI/Prolog. This will have integer vector representation of the intermediate values. We look to build on our learning from the simpler and much smaller CTC. Completion Date Goal: 2023-06-16
Milestone Title: Using BERT to try to improve mapping FTC to logic Milestone Description: Full BERT models – see if we can tune these with the previous two deliverables – and build on the CTC. Completion Date Goal: 2023-07-14
{Empty}
{Empty}
{Empty}
The student will learn about transformer models such as BERT.
The student will learn about ErgoAI.
The student will learn some data architecture.
The student will learn to organize the legal text for storage to make retrieval easy and mapping easy to either ErgoAI or Catala-lang.
This transformed/organized tax code will be stored in a relational database such as MySQL.
The student will learn SQL and how to interact with a relational database through a database workbench.
The student will learn how to work with a relational database from a language like Python.