Referencias
Datasets
- Maritaca AI. OAB Bench — Benchmark para avaliacao de modelos em questoes discursivas da OAB.
- Repositorio: https://github.com/maritaca-ai/oab-bench
-
Artigo: https://dl.acm.org/doi/pdf/10.1145/3769126.3769227
-
Garcia, E. OAB Exams — Questoes objetivas de multipla escolha da 1a fase do Exame da OAB (2010–2018).
- HuggingFace: https://huggingface.co/datasets/eduagarcia/oab_exams
Ferramentas e bibliotecas
- Ollama — Execucao local de LLMs. https://ollama.com/
- MiniJinja — Motor de templates Jinja2 para Python. https://github.com/mitsuhiko/minijinja
- scikit-learn — Metricas de classificacao (Precision, Recall, F1). https://scikit-learn.org/
- HuggingFace Evaluate — Framework unificado para metricas de NLP. https://huggingface.co/docs/evaluate/
- ROUGE Score — Implementacao da metrica ROUGE. https://github.com/google-research/google-research/tree/master/rouge
- BERTScore — Similaridade semantica via embeddings contextuais. https://github.com/Tiiiger/bert_score
- Matplotlib — Geracao de graficos. https://matplotlib.org/
- Pandas — Manipulacao de dados tabulares. https://pandas.pydata.org/
Artigos e guias
- Databricks. Best Practices and Methods for LLM Evaluation.
- Confident AI. LLM Evaluation Metrics: Everything You Need for LLM Evaluation.
- Databricks. LLM Auto-Eval Best Practices for RAG.
- Zhao, H. et al. LLM Evaluation: A Comprehensive Survey. arXiv, 2025.
- IBM. Tokenizacao: o que e e como funciona.