There are numerous differences between audio and video - that hardly requires saying. I would like to draw your attention to a subset of those differences and then on to a particularly significant one. The subset is how humans perceive audio and video 'data'.
When the typical person looks at an image carefully, he can extract all the information there is in an image straight away - the objects as well as their relative positions, relative sizes, colors (if color information is present), and motion (in some circumstances). If there is noise in the image, looking longer reveals little or no additional information. Likewise, when a typical person listens to audio, he can extract much of the same information (except color and similar visual-only things).
However, if the audio is noisy, additional listening can reveal more information. Humans can listen through noise, including interfering speech, by focusing their attention and adaptively 'filtering' out the noise. This is a very significant difference and it makes itself known daily to analysts, transcribers, forensic examiners, detectives, reporters, and other professionals who deal with audio and video.