| 深度学习只能使用实数吗?本文简要介绍了近期一些将复数应用于深度学习的若干研究,并指出使用复数可以实现更鲁棒的层间梯度信息传播、更高的记忆容量、更准确的遗忘行为、大幅降低的网络规模,以及 GAN 训练中更好的稳定性。
曼德布洛特复数集合:http://ift.tt/1qiNqGB
深度学习只能使用实数,大家不觉得奇怪吗?或许,深度学习使用复数才是更加奇怪的事情吧(注意:复数是有虚部的)。一个有价值的论点是:大脑在计算的时候不太可能使用复数。当然你也可以提出这样的论点:大脑也不用矩阵运算或者链式法则微分啊。此外,人工神经网络(ANN)具有实际神经元的模型。长期以来,我们用实分析代替了生物合理性(biological plausibility)。
然而,为什么我们要止步于实分析呢?我们已经用了这么久线性代数和微分方程,那我们也可以将这一切都推倒,用复分析建立新的一套。或许更加奇妙的复分析会赋予我们更强大的方法。毕竟它对量子力学奏效,那么它也有可能在深度学习领域发挥作用。此外,深度学习和量子力学都与信息处理有关,二者可能是同一件事情。
由于论据的原因,我们暂且不考虑生物合理性。这是一个很古老的观点,可以追溯到 1957 年 Frank Rosenblatt 第一次提出人工神经网络的时候。那么问题来了,复数可以提供哪些实数不能提供的东西呢?
在过去几年里,曾经出现过一些探索在深度学习中使用复数的文章。奇怪的是,它们中的大部分都没有被同行评议的期刊接受。因为深度学习的正统观念在该领域已经很流行了。但是,我们还是要评述一些有趣的论文。
DeepMind 的论文《Associative Long Short-Term Memory》(Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, Alex Graves)探讨了使用复数值形成联想记忆神经网络。该系统被用来增强 LSTM 的记忆。论文的结论是使用复数的网络可获取更大的记忆容量。根据数学原理,与仅仅使用实数的情况相比,使用复数需要的矩阵更小。如下图所示,使用复数的神经网络在内存开销上与传统 LSTM 有显著区别。
Yoshua Bengio 及其在蒙特利尔的团队探索了另一种使用复数的方式。研究者在《Unitary Evolution Recurrent Neural Networks》(Martin Arjovsky, Amar Shah, Yoshua Bengio)一文中探讨了酉矩阵。他们认为,如果矩阵的特征值接近 1 的话,消失的梯度或许会带来实际的好处。该研究使用复数作为 RNN 网络的权重。结论如下: Empirical evidence suggests that our uRNN is better able to pass gradient information through long sequences and does not suffer from saturating hidden states as much as LSTMs 实证表明我们的 uRNN 能够更好地通过长序列传递梯度信息,并且不会遇到像 LSTM 一样多的饱和隐藏状态(saturating hidden states)。 Where they take several measurements to quantify the behavior vs more traditional RNNs: 他们做了多次实验对使用复数的网络与传统 RNN 的性能进行了量化比较: [Image: file:///-/blob/FYNAAAILoOU/R6wTZIfx-hB_HzM4SOsQiw] A system using complex values clearly has more robust and stable behavior. 使用复数的系统明显拥有更鲁棒、更稳定的性能。 A paper also involving Bengio』s group and folks at MIT ( Li Jing (http://ift.tt/2jUOCZD), Caglar Gulcehre (http://ift.tt/2y0VGdm), John Peurifoy (http://ift.tt/2jTLM6Z), Yichen Shen (http://ift.tt/2y0eYPT), Max Tegmark (http://ift.tt/2jUOEkd), Marin Soljačić (http://ift.tt/2y1riiO), Yoshua Bengio (http://ift.tt/2cFjjPk) ) extend the approach with the use of Gating mechanism. The paper「Gated Orthogonal Recurrent Units: On Learning to Forget (http://ift.tt/2jTLNb3)」(aka GORU) explores the possibility that long term dependencies are better captured and that can be lead to a more robust forgetting mechanism. In the following graph, they show that other RNN based system fail in the copying task: Bengio 团队和 MIT 合作的一篇论文《Gated Orthogonal Recurrent Units: On Learning to Forget》(Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljačić, Yoshua Bengio)提出了使用门控机制的方法。这篇论文探讨了长期依赖能够更好地被捕获以及形成一个更加鲁棒的遗忘机制的可能性。下图展示了其他基于 RNN 的系统在复制任务中的失败; [Image: file:///-/blob/FYNAAAILoOU/WdS_SSpLhF1zDBqOKNPcRg] A team at FAIR and EPFL ( Cijo Jose (http://ift.tt/2sxeQBD), Moustpaha Cisse (http://ift.tt/2rXM7bX) and Francois Fleuret (http://ift.tt/2sxaRFe) ) has a similar paper in「Kronecker Recurrent Units (http://ift.tt/2y0VIlu)」where they also use unitary matrices to show viability in the copying task. They show a method of matrix factorization that greatly reduces the parameters required. The paper describes their motivation of using complex values: FAIR 和 EPFL 的一个团队出了一篇类似的论文《Kronecker Recurrent Units》(Cijo Jose, Moustpaha Cisse, Francois Fleuret),他们在论文里也展现了在复制任务中使用酉矩阵的可行性。他们展示了一种能够大幅减少所需参数的矩阵分解方法。文中描述了他们使用复数的动机。 Since the determinant is a continuous function the unitary set in real space is disconnected. Consequently, with the real-valued networks we cannot span the full unitary set using the standard continuous optimization procedures. On the contrary, the unitary set is connected in the complex space as its determinants are the points on the unit circle and we do not have this issue. 由于实空间的行列式是连续函数,所以实空间的酉集是不连贯的。因而,使用标准的连续优化程序不能在实值网络上跨越全酉集。相反,酉集在复空间中是连接在一起的,因为它的行列式是复空间中单位圆上的点,所以使用复数就不会出现这个问题。 One of the gems in this paper is this very insightful architectural idea: 这篇论文的精华之一就是下面这则富有建设性的思想: the state should remain of high dimension to allow the use of high-capacity networks to encode the input into the internal state, and to extract the predicted value, but the recurrent dynamic itself can, and should, be implemented with a low-capacity model. 状态应当保持高维度,以使用高容量的网络将输入编码成内部状态、提取预测值。但 recurrent dynamic 可使用低容量模型实现。 So far, these methods have explored the use of complex values in RNNs. A recent paper from MILA「Deep Complex Networks (http://ift.tt/2jTLOvD)」( Chiheb Trabelsi (http://ift.tt/2y2pNRu) et al.) further explores the approach in its use to convolution networks. The authors test their network on vision tasks, with competitive results. 目前,这些方法已经探索了在 RNN 上对复数值的使用。MILA(蒙特利尔学习算法研究所)最近的一篇论文《Deep Complex Networks》(Chiheb Trabelsi 等人)进一步探索了这些方法在卷积神经网络上的使用。论文作者在计算机视觉任务上测试了他们的网络,结果很有竞争力。 Finally, we have to mention something about its use in GANs. After all, this seems to be the hottest topic. A paper「Numerics of GANs (http://ift.tt/2jTLP2F)」(by Lars Mescheder (http://ift.tt/2y0VISw), Sebastian Nowozin (http://ift.tt/2jUOHwp), Andreas Geiger (http://ift.tt/2y16HLD) ) explores the troublesome convergent properties of GANs. They explore the characteristics of the Jacobian with complex values. Which they use to create a state-of-the-art approach to the problem of GAN equilibrium. 最后,我们必须说一下复数在 GAN 中的使用。毕竟 GAN 可以说是最热的话题了。论文《Numerics of GANs》(Lars Mescheder, Sebastian Nowozin, Andreas Geiger)探讨了 GAN 中棘手的收敛性能。他们研究了带有复数值的雅克比矩阵的特点,并使用它创建解决 GAN 均衡问题的最先进方法。 In a post last year, I wrote about the relationship between the Holographic Principle and Deep Learning (http://ift.tt/2htJNQt). The approach explored the similarity of Tensor networks with that of Deep Learning architectures. Quantum mechanics can be thought of using a more generalized form for probability. The use of complex numbers permits additional capabilities that can』t be found in normal probability. More specifically, the capability of superposition and interference. To achieve holography, it』s always nice to have complex numbers at your disposal. 在去年的一篇博文中,我介绍了全息原理和深度学习的关系(http://ift.tt/2z8L1KP The research papers mentioned here shows that there indeed many「real」advantages of using complex values in deep learning architectures. The research indicates more robust transmittal of gradient information across layers, higher memory capacity, more precise forgetting behavior, drastically reduced network sizes for sequences and greater stability in GAN training. These are too many advantages that cannot be simply ignored. If we are to accept the present Deep Learning orthodoxy of any layer that differentiable is fair game, then perhaps we should make use of complex analysis where the options in this grocery store has a lot more variety: 本文提及的研究论文证明了:在深度学习架构中使用复数确实会带来「实实在在」的优势。研究表明:使用复数能够带来更鲁棒的层间梯度信息传播、更高的记忆容量、更准确的遗忘行为、大幅降低的网络规模,以及训练 GAN 时更好的稳定性。这些优点可不能被简单地忽略。如果我们接受了目前深度学习的主流观点--任何一层的微分都是公平的,那么或许我们应该在存储很多变体的网络中使用复分析。 Perhaps one reason complex numbers aren』t used as often is the lack of familiarity by researchers. The mathematical heritage of the optimization community doesn』t involve the use of complex numbers. There』s little need for complex numbers in Operational Research. Physicists on the other hand use it all the time. Those imaginary numbers keep popping up all the time in uantum mechanics. It isn』t weird, it just happens to reflect reality. We still have little understanding of why these DL systems work so well. So seeking out alternative formulations could lead to some unexpected breakthroughs. This is the game we play today, the team that accidentally stumbles on the AGI breakthrough wins the entire pot! 或许复数没有被经常使用的原因是研究者对它不够熟悉。在优化研究社区中,数学传统并没有涉及到复数。然而物理学家却一直在使用复数。那些虚部在量子力学中始终是存在的。这并不奇怪,这就是现实。我们仍然不太理解为何这些深度学习系统会如此有用。所以探索其他的表示可能会带来出乎意料的突破。-这就是我们今天所做的事情,无意中在困在 AGI,并取得突破的队伍赢得了全剧。- In the near future, the tables may turn. The use of complex values may be more common place in SOTA architectures and its absence may turn out to be odd. 在不久的将来,这个局面可能会变化。最先进的结构可能会普遍使用复数,那时候不使用复数反倒变得奇怪了。
原文链接:http://ift.tt/2xpvNmC
]]> 原文: http://ift.tt/2z8L2hR |
没有评论:
发表评论