Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий

Качков Д. И.

Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий

Качков Д. И.

2020

https://doi.org/10.37661/1816-0301-2020-17-4-61-72

Представлен очерк развития технологий обработки естественного языка, которые легли в основу BERT (Bidirectional Encoder Representations from Transformers) − языковой модели от компании Google, демонстрирующей высокие результаты на целом классе задач, связанных с пониманием естественного языка. Две ключевые идеи, реализованные в BERT, – это перенос знаний и механизм внимания. Модель предобучена решению нескольких задач на обширном корпусе неразмеченных данных и может применять обнаруженные языковые закономерности для эффективного дообучения под конкретную проблему обработки текста. Использованная архитектура Transformer основана на внимании, т. е. предполагает оценку взаимосвязей между токенами входных данных. В статье отмечены сильные и слабые стороны BERT и направления дальнейшего усовершенствования модели.

Качков Д. И. Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий. Информатика. 2020;17(4):61-72. https://doi.org/10.37661/1816-0301-2020-17-4-61-72

Цитирование

Список литературы

1. Cho K., Merriënboer B. van, Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179

2. Sutskever I. Sequence to Sequence Learning with Neural Networks / I. Sutskever, O. Vinyals, Q. V. Le // Advances in Neural Information Processing Systems. — 2014. — P. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215

3. Serban I. V., Lowe R., Charlin L., Pineau J. Generative deep neural networks for dialogue: A short review. Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. Available at: https://arxiv.org/abs/1611.06216 (accessed 07.07.2020).

4. Vinyals O. Show and tell: A neural image caption generator / O. Vinyals, A. Toshev, S. Bengio, D. Erhan // Proceedings of the IEEE conference on computer vision and pattern recognition. — 2015. — P. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

5. Loyola P. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes / P. Loyola., E. Marrese-Taylor, Y. Matsuo // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 2. — P. 287-292. https://doi.org/10.18653/v1/P17-2045

6. Lebret R. Neural Text Generation from Structured Data with Application to the Biography Domain / R. Lebret., D. Grangier, M. Auli // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 1203–1213. https://doi.org/10.18653/v1/D16-1128

7. Николенко С., Кандурин А., Архангельская Е. Глубокое обучение. — Санкт-Петербург: Питер, 2020. — 480 с.

8. Bahdanau D. Neural Machine Translation by Jointly Learning to Align and Translate / D. Bahdanau, K. Cho, Y. Bengio // International Conference on Learning Representations. — 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473

9. Schuster M. Bidirectional recurrent neural networks / M. Schuster, K. K. Paliwal // Signal Processing, IEEE Transactions on 45.11. — 1997. — P. 2673-2681. https://doi.org/10.1109/78.650093

10. Luong T. Effective Approaches to Attention-based Neural Machine Translation / T. Luong, H. Pham, C. D. Manning // Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1412-1421. https://doi.org/10.18653/v1/D15-1166

11. Chung J. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation / J. Chung, K. Cho, Y. Bengio // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. — 2016. — Vol. 1. — P. 1693–1703. https://doi.org/10.18653/v1/P16-1160

12. Rush A. A Neural Attention Model for Abstractive Sentence Summarization / A. Rush, S. Chorpa, J. Weston // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 379–389. https://doi.org/10.18653/v1/D15-1044

13. Attention-Based Models for Speech Recognition / J. Chorowski [et al.] // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 1. — P. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503

14. Chan W. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition / W. Chan, N. Jaitly, Q. V. Le, O. Vinyals // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). — 2016. — P. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621

15. Teaching Machines to Read and Comprehend / K. M. Hermann [et al.] // Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems 2015. — 2015. — P. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340

16. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation / Y. Wu [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1609.08144

17. Hochreiter S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. — 1997. — Vol. 9 (8). — P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

18. Cho K. On the properties of neural machinetranslation: Encoder-decoder approaches / K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio // Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation — 2014. — P. 103–111. https://doi.org/10.3115/v1/W14-4012

19. Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length // International Conference on Learning Representations. — 2018. ArXiv preprint:https://arxiv.org/abs/1709.04057

20. Neural machine translation in linear time / N. Kalchbrenner [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1610.10099.

21. Convolutional sequence to sequence learning / J. Gehring [et al.] // Proceedings of the 34th International Conference on Machine Learning — 2017. — Vol. 70. — P. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122

22. LeCun Y. Gradient-based learning applied to document recognition / Y. LeCun, L. Bottou, Y. Bengio, P. Haffner // Proceedings of the IEEE. — 1998. — Vol. 86 (11). — P. 2278–2324. https://doi.org/10.1109/5.726791

23. Parikh A. P. A Decomposable Attention Model for Natural Language Inference / A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 2249–2255. https://doi.org/10.18653/v1/D16-1244

24. Attention Is All You Need / A. Vaswani [et al.] // Publication: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. — 2017. — P. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762

25. Mitkov R. Anaphora Resolution: The State of the Art. / R. Mitkov // Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution. — University of Wolverhampton. — 1999.

26. Ba J. L. Layer normalization / J. L. Ba, J. R. Kiros, G. E. Hinton // ArXiv preprint. — 2016. https://arxiv.org/abs/1607.06450.

27. Neural Speech Synthesis with Transformer Network / N. Li [et al.] // The AAAI Conference on Artificial Intelligence (AAAI). — 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895

28. Khandelwal U. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer / U. Khandelwal, K. Clark, D. Jurafsky, Ł. Kaiser. // ArXiv preprint. — 2019. https://arxiv.org/abs/1905.08836

29. Vlasov V. Dialogue Transformers / V. Vlasov, J. E. M. Mosig, A. Nicho // ArXiv preprint. — 2019. https://arxiv.org/abs/1910.00486

30. Griffith K. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations / K. Griffith and J. Kalita // 2019 International Conference on Computational Science and Computational Intelligence (CSCI). — 2019. — P. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101

31. Kang W. Self-Attentive Sequential Recommendation / W. Kang, J. McAuley // 2018 IEEE International Conference on Data Mining (ICDM). — 2018. — P. 197-206.https://doi.org/10.1109/ICDM.2018.00035.

32. Music Transformer / C.-Z. A. Huang [et al.] // ArXiv preprint. — 2018. https://arxiv.org/abs/1809.04281.

33. Universal Transformers / M. Dehghani [et al.] // 7th International Conference on Learning Representations. — 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819.

34. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context / Z. Dai [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 2978–2988. https://doi.org/10.18653/v1/P19-1285

35. So D. R. The Evolved Transformer / D. R. So, C. Liang, Q. V. Le // Proceedings of the 36th International Conference on Machine Learning. — 2019. — P. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117

36. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention / C. Zhao [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)

37. Mikolov T. Distributed Representations of Words and Phrases and their Compositionality / T. Mikolov, K. Chen, G. Corrado, J. Dean // Proceedings of the 26th International Conference on Neural Information Processing Systems. — 2013. — Vol. 2. — P. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546

38. Pennington J. Glove: Global Vectors for Word Representation / J. Pennington, R. Socher, C. D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1532–1543. https://doi.org/10.3115/v1/D14-1162

39. Sahlgren M. The Distributional Hypothesis. From context to meaning / M. Sahlgren // Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica. — Vol. 20 (1). — 2008. — P. 33—53.

40. B. McCann. Learned in Translation: Contextualized Word Vectors / B. McCann, J. Bradbury, C. Xiong, R. Socher // 31st Conference on Neural Information Processing Systems, Long Beach. — 2017. — P. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107

41. Hedderich M. A. Using Multi-Sense Vector Embeddings for Reverse Dictionaries / M. A. Hedderich, A. Yates, D. Klakow, G. de Melo // Proceedings of the 13th International Conference on Computational Semantics - Long Papers. — 2019. — P. 247–258. https://doi.org/10.18653/v1/W19-0421

42. Ruder S. Neural Transfer Learning for Natural Language Processing / S. Ruder // Ph.D. thesis, National University of Ireland, Galway. — 2019.

43. ImageNet: A large-scale hierarchical image database / J. Deng [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2009. — P. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

44. Towards Accurate Multi-person Pose Estimation in the Wild / G. Papandreou [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2017. — P. 3711-3719. https://doi.org/10.1109/CVPR.2017.395

45. He K. Mask R-CNN / K. He, G. Gkioxari, P. Dollár, R. Girshick // IEEE International Conference on Computer Vision. — 2017. — P. 2980-2988. https://doi.org/10.1109/ICCV.2017.322

46. Exploring the Limits of Weakly Supervised Pretraining / D. Mahajan [et al.] // European Conference on Computer Vision. — 2018. — P. 181–196 https://doi.org/10.1007/978-3-030-01216-8_12

47. Dai A. M. Semi-supervised Sequence Learning / A. M. Dai, Q. V. Le // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 2. — P. 3079–3087. https://doi.org/10.18653/v1/P17-1161

48. Peters M. E. Semi-supervised sequence tagging with bidirectional language models / M. E. Peters, W. Ammar, C. Bhagavatula, R. Power // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 1. — P. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108

49. Howard J. Universal Language Model Fine-tuning for Text Classification / J. Howard, S. Ruder // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. — 2018. — Vol. 1. — P. 328–339. https://doi.org/10.18653/v1/P18-1031

50. Deep contextualized word representations / M. E. Peters [et al.] // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2018. — Vol. 1. — P. 2227–2237. https://doi.org/10.18653/v1/N18-1202

51. Merity S. Pointer Sentinel Mixture Models / S. Merity, C. Xiong, J. Bradbury, R. Socher // 5th International Conference on Learning Representations. — 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843

52. Radford A. Improving language understanding with unsupervised learning / A. Radford, K. Narasimhan, T. Salimans, I. Sutskever // Technical report, OpenAI. — 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)

53. Generating Wikipedia by Summarizing Long Sequences / P. J. Liu [et al.] // 6th International Conference on Learning Representations. — 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198

54. Devlin J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2019. — Vol. 1. — P. 4171–4186. https://doi.org/10.18653/v1/N19-1423

55. Taylor W. L. Cloze procedure: A new tool for measuring readability / W. L. Taylor // Journalism Bulletin. — 1953. — Vol. 30(4) — P. 415–433.

56. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books / Y. Zhu [et al.] // Proceedings of the IEEE international conference on computer vision. — 2015. — P. 19–27. https://doi.org/10.1109/ICCV.2015.11

57. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding / A. Wang [et al.] // Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. — 2018. — P. 353–355. https://doi.org/10.18653/v1/W18-5446

58. RoBERTa: A Robustly Optimized BERT Pretraining Approach / Y. Liu [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1907.11692

59. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations / Z. Lan [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)

60. Sanh V. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter / V. Sanh, L. Debut, J. Chaumond, T. Wolf // Conference on Neural Information Processing Systems. — 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108.

61. Hinton G. Distilling the Knowledge in a Neural Network / G. Hinton, O. Vinyals, J. Dean // Neural Information Processing Systems. Deep Learning and Representation Learning Workshop. — 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531

62. TinyBERT: Distilling BERT for Natural Language Understanding / X. Jiao [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1909.10351

63. Liu. X. Multi-Task Deep Neural Networks for Natural Language Understanding / X. Liu, P. He, W. Chen, J. Gao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4487–4496. https://doi.org/10.18653/v1/P19-1441

64. Representation learning using multi-task deep neural networks for semantic classification and information retrieval / X. Liu [et al.] // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 912–921. https://doi.org/10.3115/v1/N15-1092

65. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding / W. Wang [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)

66. Elman J. L. Finding structure in time / Elman J. L. // Cognitive science. — 1990. — Vol. 14 (2). — P. 179–211.

67. BioBERT: a pre-trained biomedical language representation model for biomedical text mining / J. Lee [et al.] // Bioinformatics. — 2020. — Volume 36 (4). — P. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

68. Lu J. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks / J. Lu, D. Batra, D. Parikh, S. Lee // ArXiv preprint. — 2019. https://arxiv.org/abs/1908.02265

69. Niven T. Probing Neural Network Comprehension of Natural Language Arguments / T. Niven, H.-Y. Kao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4658–4664. https://doi.org/10.18653/v1/P19-1459

70. HellaSwag: Can a Machine Really Finish Your Sentence? / R. Zellers [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4791–4800. https://doi.org/10.18653/v1/P19-1472

71. McCoy T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference / T. McCoy, E. Pavlick, T. Linzen // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 3428–3448. https://doi.org/10.18653/v1/P19-1334

Источник

Информация

Автор

Качков Д. И. Белорусский государственный университет

Страницы

61-72

Год издания

2020

Ключевые слова

Коллекции

Информатика