Played audio frames included in first audio content may be received over one or more networks. The first audio content may further include a replaced audio frame. The first audio content may correspond to video content that includes video of a face of a person as the person utters speech that is captured in the first audio content. Location data may also be received over the one or more networks. The location data may indicate locations of facial features of the face of the person in a video frame of the video content. The video frame may correspond to the replaced audio frame. Audio output may be generated that approximates a portion of the speech corresponding to the replaced audio frame. The audio output may be inserted into a replacement audio frame. Second audio content may be played including the played audio frames and the replacement audio frame.
Supplementary notes can be added here, including code, math, and images.